On Tuesday 30 Aug 2011 18:32:17 Ian Clarke wrote: > On Mon, Aug 29, 2011 at 6:42 PM, Matthew Toseland <toad at > amphibian.dyndns.org > > wrote: > > > On Monday 29 Aug 2011 18:58:26 Ian Clarke wrote: > > > On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland < > > > Right, the same is true of queueing. If nodes are forced to do things to > > > deal with overloading that make the problem worse then the load balancing > > > algorithm has failed. Its job is to prevent that from happening. > > > > Not true. Queueing does not make anything worse (for bulk requests where we > > are not latency sensitive). **When a request is waiting for progress on a > > queue, it is not using any bandwidth!** > > > > I thought there was some issue where outstanding requests occupied "slots" > or something?
Slots are arbitrary, a tool for estimating usage in order to avoid timeouts; the real resource is bandwidth. If we end up using less than the limit because we are waiting for slots to free, then the simplest solution is simply to add more slots (and increase the timeouts if necessary). > > Regardless, even if queueing doesn't use additional bandwidth or CPU > resources, it also doesn't use any less of these resources - so it doesn't > actually help to alleviate any load (unless it results in a timeout in which > case it uses more of everything). Agree that timeouts are bad. The basic function of queueing is to match incoming requests better to outgoing requests. > > And it does use more of one very important resource, which is the initial > requestor's time. I mean, ultimately the symptom of overloading is that > requests take longer, and queueing makes that problem worse. For realtime requests, latency matters. For bulk requests (e.g. big downloads), it doesn't matter. Throughput matters. The time it takes to transfer 100MB of data is determined by the time it takes to start each request, the number that can run in parallel, and the time it takes for each one to run. You are saying that only the last variable matters, when in fact it is largely irrelevant provided that the others change too. More important still is routing accuracy however. If we go more hops we waste more resources, as ArneBab has pointed out. > > Queueing should be a last resort, the *right* load balancing algorithm > should avoid situations where queueing must occur. Substitute "misrouting" for "queueing". Queueing is neutral: It is an annoyance at worst, and if it enables greater efficiency at the same or greater throughput, it's worth it. Whereas misrouting causes a serious aggravation of the underlying problem (bandwidth usage, *the* scarce resource here) by transferring the same data over more hops. Except for those requests where we are explicitly targeting low latency - realtime fproxy requests - which can be treated differently, and are usually for very popular keys anyway. The point is, NLM achieves, in practice as well as in theory, good routing accuracy. Throughput is greatly reduced, mainly because: 1) There is insufficient feedback to the request originators, and 2) Request take longer. (With AIMD's enabled, a bit over 1 minute vs more like 18 seconds). We can solve both problems relatively easily: - Increase the number of allowed requests, by increasing the bandwidth limiter's time-to-transfer-everything-in-worst-case parameter, from 120 to say 240. Since when NLM is enabled we have some spare bandwidth to play with, this will increase throughput without significantly increasing transfer times. It won't directly increase queueing times (because twice the number of requests are chasing twice the number of slots), at least not until we reach the point where it affects transfer times. - Keep AIMD's, so that the originators slow down when there is a problem. - Make AIMD's more effective by sending slow-down messages when we are approaching our capacity and not only when we are past it and rejecting requests i.e. misrouting. - Keep fair sharing and improve it. I have shown in my other mail that this is necessary regardless. - Optionally, separate load management for SSKs from that for CHKs. Queueing times, according to ArneBab's maths, are dependant on the transfer time, which is much, much higher for CHKs than for SSKs. Therefore SSKs, with their very low success rates due to messaging apps, are having to wait much longer because of the CHKs (which mostly succeed). > > Ian. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110831/c3e438c7/attachment.pgp>