[freenet-dev] Latency optimisation way forward

Matthew Toseland Wed, 14 Jan 2009 14:44:25 +0000

Argh. It looks like a lot of the variability (and optimistic figures) of the 
below are caused by my looking at all requests rather than local requests. 
Local requests obviously are vastly fewer in number... but they give 
full-request times.


These are my current times from the stats page:

Successful      40.870s
Unsuccessful    11.796s
Average         12.106s

However these seem to be more variable even than the remote averages, probably 
due to different content fetched by different nodes being harder to find and 
taking more hops...

E.g. nextgens' is 26 10 25, TheSeeker was 30 11 14.

So it looks like we'll have to use both ...

Recomputing the turtling table below based on my local requests over the same 
period (midnight to 11:00):

$ zgrep "Successful CHK fetch took" fast/logs-dark/freenet-1197-2009-01-14-0* 
fast/logs-dark/freenet-1197-2009-01-14-10-* | sed "s/^.*Successful CHK fetch 
took //" > local-only-timings.list

Cutoff  % reduction in mean request time                % of requests turtled
15s             84%                                                             
59%
30s             73%                                                             
37%
45s             65%                                                             
22%
60s             59%                                                             
16%
90s             51%                                                             
7.6%
120s            46%                                                             
4.4%

On the other hand, on average a request going through a given node must be 
half way through its request ... so we could reasonably just double the 
timings, if my node was a representative sample of the requests on the 
overall network... on the theory that the likelihood of a node being really 
slow is doubled for twice the number of hops, and that the search time is 
proportional to the number of hops too.

Cutoff  % reduction in mean request time                % of requests turtled
30s             74-76%                                                  41-42%
60s             56-59%                                                  22-23%
90s             42-48%                                                  12-14%
120s            31-39%                                                  7-9%

These are actually reasonably similar, especially in the 30 and 60 second 
slots... Both show about a 50% gain if we turtle anything above 90 seconds...

Obviously if we implement turtling we should show the proportion of local (and 
remote) requests that get a transfer failure, and if possible the proportion 
that are offered the key afterwards...

So the proposed course of action:

Turtle anything over 90 seconds. Show more stats on the web interface, 
disableable with a config option. Show the probability of a transfer failure, 
and the proportion of transfer failures that result in the key being offered 
and fetched, or found some other way, within say 20 minutes.

One problem with the above is they cut out the samples over the threshold 
rather than replacing them with the threshold ... 

Correcting the local timings table for this:

Cutoff  % reduction in mean request time                % of requests turtled
30s             45%                                                             
37%
60s             25%                                                             
16%
90s             15%                                                             
7.6%
120s            10%                                                             
4.4%

The doubled remote timings table:

Cutoff  % reduction in mean request time                % of requests turtled
30s             59%                                                             
41-42%
60s             40%                                                             
22-23%
90s             28%                                                             
12-14%
120s            9%                                                              
7-9%

Hence if we use 30 seconds we should get a 45-59% gain, but at the cost of 
turtling around 40% of requests; at 60 seconds, the gain may not be large 
enough to be easily detectable. BUT we don't have to abandon a transfer after 
a set number of seconds: we can for example switch it to turtle mode after a 
block takes more than N seconds to transfer. If we set this to say 3, we 
should see no non-turtle transfers over 90 seconds, and most much less than 
that, a reduction in mean time on the order of 50%, and hopefully not too 
many requests turtled. Probably worth trying ... it may make sense to do some 
load management changes first...

On Wednesday 14 January 2009 13:57, Matthew Toseland wrote:
> At the moment, having taken out the recent latency optimisation changes 
(that 
> resulted in a massive cut in bandwidth usage), latency is way up:
> - Median CHK request time 11.2 seconds.
> - Mean 22-23 seconds.
> - 41-42% of requests take more than 15 seconds to complete.
> - 22-23% of requests take more than 30 seconds to complete.
> - 7-9% of requests take more than 60 seconds to complete.
> 
> These figures are based on a sample of approx 11 hours overnight, after it 
> became mandatory (may include some UOM), and a sample of half an hour around 
> 12ish. The two agree very closely. TheSeeker's node shows a 13 second median 
> and a 27 second mean. You can get similar results by setting log level 
> details to freenet.node.RequestSender:MINOR, then:
> 
> Just follow the internally updated median/mean:
> $ tail --follow=name --retry fast/logs-dark/freenet-latest.log | 
> grep "Successful CHK request took"
> 
> Grep for individual timings:
> $ zgrep "Successful CHK request took" 
> fast/logs-dark/freenet-1197-2009-01-14-0* 
> fast/logs-dark/freenet-1197-2009-01-14-10-* | sed "s/^.*Successful CHK 
> request took //" | sed "s/ average Median.*$//" > timings2.list
> 
> Sort them and view them in less to get percentiles etc:
> $ cat timings.list | sort -n | less
> (Use the -N option to show line numbers)
> 
> Get mean excluding outliers over some value:
> $ cat timings.list | (total=0; count=0; while read x; do if test $x -gt 
30000; 
> then echo Over 30 seconds: $x; else count=$((count+1)); total=$((total+x)); 
> fi; done; echo Total is $total count is $count average is $(( $total / 
> $count )))
> 
> 
> Yesterday (1196, transfer backoff and Saturday's throttling), these stats 
were 
> a 4 second median and 8 second mean. The 90th percentile was 15-17 seconds 
> yesterday and is 50-57 seconds today.
> 
> However on Tuesday (1195, Saturday's throttling but not transfer backoff), 
it 
> was more like a 3 second median and a mean fluctuating a lot due to some 
high 
> values every now and then, around 13 seconds later on when there was more 
> data. Of course there are time of day effects. :|
> 
> The main result of yesterday's testing (transfer backoff on transfers taking 
> more than 15 seconds) was that there was a vast amount of backoff, and even 
> lower bandwidth usage than tuesday, presumably because lots of nodes are 
> affected by a single slow transfer. Users reported less than half their 
> backoff was due to transfer backoff, otoh ... it was over half for me for a 
> while, but it reduced as a proportion over the day.
> 
> We could cut the average CHK request time significantly at the cost of a 
> somewhat smaller proportion of requests failing at a given threshold and 
> having to continue on the last hop only as a turtle-request; when the 
> transfer completes, we would offer it to the nodes that have asked for it in 
> the past.
> 
> Cutoff        % reduction in mean request time                % of requests 
> turtled
> 15s           74-76%                                                  41-42%
> 30s           56-59%                                                  22-23%
> 45s           42-48%                                                  12-14%
> 60s           31-39%                                                  7-9%
> 
> Obviously whatever proportion of requests are turtled, the fproxy psuccess 
is 
> likely to be reduced by that much. :| OTOH it shouldn't affect queued 
> requests much.
> 
> IMHO the system is over-optimised for throughput at the moment. The fact 
that 
> the mean didn't decrease on Tuesday (although some users are seeing much 
> higher figures than the above quotes, probably transient though) is probably 
> due to outliers perhaps related to the significant backoff resulting from 
the 
> over-aggressive solutions I have tried so far. With Saturday's limiting 
> turned off, the main limiter on the number of requests a node accepts is 
> output bandwidth liability limiting, which works on the principle of 
assuming 
> that every request in flight will succeed, and working out how many can be 
> accepted if they must all complete in 90 seconds. We could probably reduce 
> this to 60 without a significant adverse effect on bandwidth usage. 
> Saturday's limiting works similarly but uses the average bytes used for a 
> request i.e. it takes the short-term psuccess into account. It has a much 
> shorter threshold (5 seconds), and doesn't try to compensate for overheads. 
> It might be interesting to reinstate this with a much higher threshold (20 
> seconds??). Hopefully the combination would make the above table more 
> attractive: if the last column's values could be halved, for example, 
without 
> severely impacting on bandwidth usage, the combination would be very 
> attractive. IMHO turtling support (or at least much stricter transfer 
> timeouts) is necessary for reasons of attack resistance; and the current 
> proposal (in a previous mail) incorporates the best part of Ian's transfer 
> backoff without flooding the network with backoff.
> 
> A last resort would be a bulk vs realtime flag on requests. Bulk requests 
> could be handled separately from real-time requests. Real-time requests 
would 
> have a higher transfer priority, but would be limited to some proportion of 
> the overall bandwidth usage, would only tolerate fast transfers, and in 
> future might be routed more quickly / queued for less time (and therefore to 
> a less ideal target). Bulk requests would be optimised for psuccess 
primarily 
> and then for throughput, tolerating reasonably long transfer times (but not 
> the 48 minutes theoretically possible now!). This has been suggested in the 
> past, obviously it costs us some request indistinguishability, but maybe the 
> time for it is soon. Anyway a proper proposal would need to be fleshed out. 
> Arguably ULPRs obsolete bulk requests.
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 827 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20090114/62c3892f/attachment.pgp>

[freenet-dev] Latency optimisation way forward

Reply via email to