Matthew,

This makes sense from the perspective of making incremental changes, but I
think we need to be more drastic than that.  I think we need to go back to
the drawing board with load management.  We need to find a solution that is
simple enough to reason about, and to debug if we have problems with it.

The entire approach of coming up with hypotheses about what is wrong,
building a solution based on these hypotheses (without actually confirming
that the hypotheses are accurate) and deploying it is deja vu, we've been
doing it for a decade, and we still haven't got load management right.
 We're just layering more complexity onto a system that we already don't
understand, based on guesses as to what the problems were with the previous
iteration that we can't test because the system is too complicated with too
many interactions for anyone to get their heads around it.

If something isn't working, and we don't understand for definite what is
wrong with it, then we shouldn't build more on top of it in the hope that
we'll accidentally solve the problem, we should replace it with something
simple enough that we do understand it, right?

The purpose of load management is relatively simple: *Don't allow clients to
pump more requests into the network than the network can handle, while
ensuring that this workload is distributed across the network in a
reasonably efficient manner.  This must be done in a decentralized way that
is not vulnerable to abuse.*
*
*
The current load management system includes many different interacting
components that make it nearly impossible to understand or debug.  I think
we need to go back to the drawing board with load management, starting from
the goal I cite above.

I would invite people to suggest the simplest possible load management
schemes that might work, let's then discuss them, and figure out which is
most likely to work, and if it doesn't work, which will be easiest to debug.

We can bear in mind a few lesson's we've learned though.  What
characteristics should our load balancing system have?

   - Dropping requests should be extremely rare because this
   just exacerbates overloading
   - Delaying requests should also be extremely rare for the same reason
   - Misrouting requests should be limited, but perhaps acceptable
   occasionally, for the same reason

It therefore really comes down to nodes anticipating whether the network is
in danger of having to drop, delay, or misroute requests, and reduce the
rate at which they are pumping requests into the network if this danger
point is reached.

So how do nodes inform each-other that they are at risk of having to drop,
delay, or misroute requests?  There are two reasons this might happen.
 Either a node itself is close to its own capacity for relaying requests, or
the other nodes it relays to are themselves at or close to capacity.

One problem I'm concerned about when nodes share information about how
overloaded their peers are is a "gridlock":

What if all nodes told other nodes that they were overloaded because all
their peers are overloaded.  Such a situation would basically cause the
entire network to think it was overloaded, even though nobody actually was!
 It becomes a bit like this cartoon: http://flic.kr/p/5npfm2  How can this
be avoided?

Ian.

-- 
Ian Clarke
Founder, The Freenet Project
Email: ian at freenetproject.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20110827/71354573/attachment.html>

Reply via email to