On Tuesday 30 Aug 2011 18:32:17 Ian Clarke wrote:
> On Mon, Aug 29, 2011 at 6:42 PM, Matthew Toseland <toad at 
> amphibian.dyndns.org
> > wrote:
> 
> > On Monday 29 Aug 2011 18:58:26 Ian Clarke wrote:
> > > On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland <
> > > Right, the same is true of queueing.  If nodes are forced to do things to
> > > deal with overloading that make the problem worse then the load balancing
> > > algorithm has failed.  Its job is to prevent that from happening.
> >
> > Not true. Queueing does not make anything worse (for bulk requests where we
> > are not latency sensitive). **When a request is waiting for progress on a
> > queue, it is not using any bandwidth!**
> >
> 
> I thought there was some issue where outstanding requests occupied "slots"
> or something?

Slots are arbitrary, a tool for estimating usage in order to avoid timeouts; 
the real resource is bandwidth. If we end up using less than the limit because 
we are waiting for slots to free, then the simplest solution is simply to add 
more slots (and increase the timeouts if necessary).
> 
> Regardless, even if queueing doesn't use additional bandwidth or CPU
> resources, it also doesn't use any less of these resources - so it doesn't
> actually help to alleviate any load (unless it results in a timeout in which
> case it uses more of everything).

Agree that timeouts are bad.

The basic function of queueing is to match incoming requests better to outgoing 
requests.
> 
> And it does use more of one very important resource, which is the initial
> requestor's time.  I mean, ultimately the symptom of overloading is that
> requests take longer, and queueing makes that problem worse.

For realtime requests, latency matters. For bulk requests (e.g. big downloads), 
it doesn't matter. Throughput matters. The time it takes to transfer 100MB of 
data is determined by the time it takes to start each request, the number that 
can run in parallel, and the time it takes for each one to run. You are saying 
that only the last variable matters, when in fact it is largely irrelevant 
provided that the others change too.

More important still is routing accuracy however. If we go more hops we waste 
more resources, as ArneBab has pointed out.
> 
> Queueing should be a last resort, the *right* load balancing algorithm
> should avoid situations where queueing must occur.

Substitute "misrouting" for "queueing". Queueing is neutral: It is an annoyance 
at worst, and if it enables greater efficiency at the same or greater 
throughput, it's worth it. Whereas misrouting causes a serious aggravation of 
the underlying problem (bandwidth usage, *the* scarce resource here) by 
transferring the same data over more hops. Except for those requests where we 
are explicitly targeting low latency - realtime fproxy requests - which can be 
treated differently, and are usually for very popular keys anyway.

The point is, NLM achieves, in practice as well as in theory, good routing 
accuracy. Throughput is greatly reduced, mainly because:
1) There is insufficient feedback to the request originators, and
2) Request take longer. (With AIMD's enabled, a bit over 1 minute vs more like 
18 seconds).

We can solve both problems relatively easily:
- Increase the number of allowed requests, by increasing the bandwidth 
limiter's time-to-transfer-everything-in-worst-case parameter, from 120 to say 
240. Since when NLM is enabled we have some spare bandwidth to play with, this 
will increase throughput without significantly increasing transfer times. It 
won't directly increase queueing times (because twice the number of requests 
are chasing twice the number of slots), at least not until we reach the point 
where it affects transfer times.
- Keep AIMD's, so that the originators slow down when there is a problem.
- Make AIMD's more effective by sending slow-down messages when we are 
approaching our capacity and not only when we are past it and rejecting 
requests i.e. misrouting.
- Keep fair sharing and improve it. I have shown in my other mail that this is 
necessary regardless.
- Optionally, separate load management for SSKs from that for CHKs. Queueing 
times, according to ArneBab's maths, are dependant on the transfer time, which 
is much, much higher for CHKs than for SSKs. Therefore SSKs, with their very 
low success rates due to messaging apps, are having to wait much longer because 
of the CHKs (which mostly succeed).
> 
> Ian.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20110831/c3e438c7/attachment.pgp>

Reply via email to