[freenet-dev] Load management theory

2011-07-12 Thread Matthew Toseland
On Saturday 09 Jul 2011 21:22:04 Evan Daniel wrote:
> I've only looked at the new load management emails somewhat, not in
> detail. However, it seems to me that there hasn't been a deep enough
> theoretical analysis of the issue. I'll try to lay out what I mean by
> that here, and start the process. I don't have time to finish it
> anytime soon, but hopefully this will be useful anyway.

That seems entirely reasonable. In fact it's one of my main worries about NLM!
> 
> I see four major factors in load management, plus a couple others I'll
> mention after this that make things more complicated. Those are
> capacity of neighbors, location distribution of neighbors, incoming
> request distribution, and request misrouting tolerance.

This is reasonable.
> 
> Obviously not all neighbors have the same request-handling capacity.
> At present, Freenet acknowledges this to a limited degree via the
> bandwidth limit / peer limit relationship. It is possible we do not
> wish to allow an infinite range of capacities, because of concern
> about the effects of (non-malicious) supernodes on the network, as
> regards things like security and fault tolerance.

Right. Currently we limit the proportion of requests a peer can have. This 
results in misrouting, even with NLM. I wonder if we need to specify it more 
explicitly in NLM.
> 
> Neighbor location distribution and incoming request distribution are
> closely connected. If the two are matched, then all neighbors will be
> sent a similar number of requests. If they're mismatched, some
> neighbors will get a larger share of requests than others. Variable
> peer capacity means that a mismatch here may be desirable, in order to
> produce an outbound request pattern that matches available network
> capacity.

Possibly ... on the other hand, opennet should ensure a reasonable match. In 
fact we have people worrying about it being too reasonable - nodes with lots of 
local traffic have lots of distant peers, nodes with little local traffic have 
too few distant peers, there have been proposals for how to deal with this.

On darknet, swapping ensures peers are distributed reasonably for remote 
requests; incoming requests should generally be highly specialised as long as 
we don't have too much local traffic. HOWEVER swapping does not care about 
variable capacities, so what you say above does apply in this case...
> 
> Lastly, we have misrouting tolerance. At one extreme, we route every
> request correctly, and the capacity of the local node is limited by
> the slowest neighbor node. (To be more precise: the local capacity C_l
> becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and
> C_i is the capacity of neighbor i, and S_i is the share of incoming
> requests sent to neighbor i. In other words, the local capacity may be
> limited not by the slowest node, but by some faster node that gets
> sent a disproportionately larger share of the request traffic.)

Vive did some work suggesting that misrouting isn't always fatal, or at least 
that it is possible to usefully take into account performance. However, 
experience in 0.5 shows that misrouting is generally a very serious problem. 
The extreme case is 0.5 - we essentially routed load, specialisation was very 
weak, and as a consequence data persistence and performance was poor. In 
particular, if misrouting results in not finding data, we go more hops, and we 
are likely to retry, even if a failed request doesn't use much bandwidth; and 
if it results in storing data in the wrong place on insert it is even worse, 
resulting in a lot more effort being spent to find it.

NLM takes a fairly strict view of misrouting:
1. We pick up to 4 nodes. These are the top nodes for the request. The first 
always happens, the second happens if the ideal next node is low-capacity (less 
than min[median/2, first quartile] of our peers' capacities) the third happens 
if the request is realtime (which is irrelevant as NLM will be turned off for 
realtime in the near future), the fourth happens if we can't find a route after 
1 minute.
2. We can route to any of these.
3. After we are rejected we reroute, we can add another node as above.
4. However, if we get too many RejectedLoop's, we stop adding more. We *want* 
to timeout (that is, to fail to get a slot, to fail to route the request 
forward in 2 minutes), so that we can terminate the request, and tell the 
originator not to retry for a while because this part of the keyspace is 
difficult; this is the only mechanism we have for dealing with high capacity / 
low capacity boundaries at present.

This is of course fairly arbitrary... Ideally we'd like to have some 
theoretical basis for how much misrouting is a good thing, or even calculate it 
in real time based on some statistic. But *any* amount of misrouting 
potentially increases the total network load, so we need to be very careful 
here. IMHO such arbitrary limits are probably a reasonable first approximation, 
especially 

Re: [freenet-dev] Load management theory

2011-07-12 Thread Matthew Toseland
On Saturday 09 Jul 2011 21:22:04 Evan Daniel wrote:
> I've only looked at the new load management emails somewhat, not in
> detail. However, it seems to me that there hasn't been a deep enough
> theoretical analysis of the issue. I'll try to lay out what I mean by
> that here, and start the process. I don't have time to finish it
> anytime soon, but hopefully this will be useful anyway.

That seems entirely reasonable. In fact it's one of my main worries about NLM!
> 
> I see four major factors in load management, plus a couple others I'll
> mention after this that make things more complicated. Those are
> capacity of neighbors, location distribution of neighbors, incoming
> request distribution, and request misrouting tolerance.

This is reasonable.
> 
> Obviously not all neighbors have the same request-handling capacity.
> At present, Freenet acknowledges this to a limited degree via the
> bandwidth limit / peer limit relationship. It is possible we do not
> wish to allow an infinite range of capacities, because of concern
> about the effects of (non-malicious) supernodes on the network, as
> regards things like security and fault tolerance.

Right. Currently we limit the proportion of requests a peer can have. This 
results in misrouting, even with NLM. I wonder if we need to specify it more 
explicitly in NLM.
> 
> Neighbor location distribution and incoming request distribution are
> closely connected. If the two are matched, then all neighbors will be
> sent a similar number of requests. If they're mismatched, some
> neighbors will get a larger share of requests than others. Variable
> peer capacity means that a mismatch here may be desirable, in order to
> produce an outbound request pattern that matches available network
> capacity.

Possibly ... on the other hand, opennet should ensure a reasonable match. In 
fact we have people worrying about it being too reasonable - nodes with lots of 
local traffic have lots of distant peers, nodes with little local traffic have 
too few distant peers, there have been proposals for how to deal with this.

On darknet, swapping ensures peers are distributed reasonably for remote 
requests; incoming requests should generally be highly specialised as long as 
we don't have too much local traffic. HOWEVER swapping does not care about 
variable capacities, so what you say above does apply in this case...
> 
> Lastly, we have misrouting tolerance. At one extreme, we route every
> request correctly, and the capacity of the local node is limited by
> the slowest neighbor node. (To be more precise: the local capacity C_l
> becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and
> C_i is the capacity of neighbor i, and S_i is the share of incoming
> requests sent to neighbor i. In other words, the local capacity may be
> limited not by the slowest node, but by some faster node that gets
> sent a disproportionately larger share of the request traffic.)

Vive did some work suggesting that misrouting isn't always fatal, or at least 
that it is possible to usefully take into account performance. However, 
experience in 0.5 shows that misrouting is generally a very serious problem. 
The extreme case is 0.5 - we essentially routed load, specialisation was very 
weak, and as a consequence data persistence and performance was poor. In 
particular, if misrouting results in not finding data, we go more hops, and we 
are likely to retry, even if a failed request doesn't use much bandwidth; and 
if it results in storing data in the wrong place on insert it is even worse, 
resulting in a lot more effort being spent to find it.

NLM takes a fairly strict view of misrouting:
1. We pick up to 4 nodes. These are the top nodes for the request. The first 
always happens, the second happens if the ideal next node is low-capacity (less 
than min[median/2, first quartile] of our peers' capacities) the third happens 
if the request is realtime (which is irrelevant as NLM will be turned off for 
realtime in the near future), the fourth happens if we can't find a route after 
1 minute.
2. We can route to any of these.
3. After we are rejected we reroute, we can add another node as above.
4. However, if we get too many RejectedLoop's, we stop adding more. We *want* 
to timeout (that is, to fail to get a slot, to fail to route the request 
forward in 2 minutes), so that we can terminate the request, and tell the 
originator not to retry for a while because this part of the keyspace is 
difficult; this is the only mechanism we have for dealing with high capacity / 
low capacity boundaries at present.

This is of course fairly arbitrary... Ideally we'd like to have some 
theoretical basis for how much misrouting is a good thing, or even calculate it 
in real time based on some statistic. But *any* amount of misrouting 
potentially increases the total network load, so we need to be very careful 
here. IMHO such arbitrary limits are probably a reasonable first approximation, 
especially 

[freenet-dev] Load management theory

2011-07-09 Thread Evan Daniel
I've only looked at the new load management emails somewhat, not in
detail. However, it seems to me that there hasn't been a deep enough
theoretical analysis of the issue. I'll try to lay out what I mean by
that here, and start the process. I don't have time to finish it
anytime soon, but hopefully this will be useful anyway.

I see four major factors in load management, plus a couple others I'll
mention after this that make things more complicated. Those are
capacity of neighbors, location distribution of neighbors, incoming
request distribution, and request misrouting tolerance.

Obviously not all neighbors have the same request-handling capacity.
At present, Freenet acknowledges this to a limited degree via the
bandwidth limit / peer limit relationship. It is possible we do not
wish to allow an infinite range of capacities, because of concern
about the effects of (non-malicious) supernodes on the network, as
regards things like security and fault tolerance.

Neighbor location distribution and incoming request distribution are
closely connected. If the two are matched, then all neighbors will be
sent a similar number of requests. If they're mismatched, some
neighbors will get a larger share of requests than others. Variable
peer capacity means that a mismatch here may be desirable, in order to
produce an outbound request pattern that matches available network
capacity.

Lastly, we have misrouting tolerance. At one extreme, we route every
request correctly, and the capacity of the local node is limited by
the slowest neighbor node. (To be more precise: the local capacity C_l
becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and
C_i is the capacity of neighbor i, and S_i is the share of incoming
requests sent to neighbor i. In other words, the local capacity may be
limited not by the slowest node, but by some faster node that gets
sent a disproportionately larger share of the request traffic.)

These four interrelate, at a fundamental theoretical level. Namely,
you can't arbitrarily control all four at the same time. To use the
term loosely, there are four parameters, but only three degrees of
freedom. So, the question becomes, where do we want to exercise
control, and how can we do so most optimally?

This analysis seems to me to lead to a conclusion that we should be
trying to do load management by controlling who our peers are. We
can't do much about incoming request distribution (not entirely true
when you include local requests, more on this later) or peer capacity.
That leaves one degree of freedom between misrouting and peer location
distribution, so we find we're trading between those two.

This is, imho, not a surprising conclusion, but also not one I've seen
explicitly stated before. It's somewhat explicit in our current LRU
peer selection scheme, but not to the degree I think we should be
talking about.

There are a couple factors that complicate all of this. These include
locally originated requests, locally answered requests, and FOAF
routing. Locally answered requests aren't often discussed in detail in
simulations. For example, the simulators I've written all try to route
to the best node on the network (shortest node location to request
location distance). However, node store size variability makes things
more complicated than that. I'm also not sure how to handle it in a
nice clean fashion. We could try simulating it, but that's just going
to be an exercise in GIGO unless we have actual data to use. (Probe
requests as per issue 3568 would be good here.) However, I think we
can get rather far while continuing to ignore that issue.

Locally originated requests are trickier. Some nodes have lots, some
have very few, some vary when people browse freesites, etc. I'm not
really sure what to do about them. They need more careful analysis.
Specifically, I see two conflicting requirements: local requests need
to be taken into account when choosing peers, so that our local
requests don't swamp the long links entirely; and local requests need
to not be taken into account, so that they don't distort the overall
network topology.

FOAF routing, combined with our bandwidth-peer count relationship, is
a start at handling load by managing who our peers are. I think this
is the direction further load management work should take.

I'll end with a few ideas I haven't explored in much detail yet. They
might be good ones or bad ones, but I hope they're at least
interesting.

What if each peer advertises its available outbound bandwidth to its
peers? The peer selection code could then attempt to maintain a
constant ratio of bandwidth per fraction of requests served among
peers.

What about modifying the LRU code to correct the time since last usage
by a factor of number of FOAF locations? That is, if we have peer A
with 8 FOAF locations advertised and peer B with 20, and B is
responding to 1.5 requests for every 1 that A responds to, then A
should be treated as the better node. That is, we favor

[freenet-dev] Load management theory

2011-07-09 Thread Evan Daniel
I've only looked at the new load management emails somewhat, not in
detail. However, it seems to me that there hasn't been a deep enough
theoretical analysis of the issue. I'll try to lay out what I mean by
that here, and start the process. I don't have time to finish it
anytime soon, but hopefully this will be useful anyway.

I see four major factors in load management, plus a couple others I'll
mention after this that make things more complicated. Those are
capacity of neighbors, location distribution of neighbors, incoming
request distribution, and request misrouting tolerance.

Obviously not all neighbors have the same request-handling capacity.
At present, Freenet acknowledges this to a limited degree via the
bandwidth limit / peer limit relationship. It is possible we do not
wish to allow an infinite range of capacities, because of concern
about the effects of (non-malicious) supernodes on the network, as
regards things like security and fault tolerance.

Neighbor location distribution and incoming request distribution are
closely connected. If the two are matched, then all neighbors will be
sent a similar number of requests. If they're mismatched, some
neighbors will get a larger share of requests than others. Variable
peer capacity means that a mismatch here may be desirable, in order to
produce an outbound request pattern that matches available network
capacity.

Lastly, we have misrouting tolerance. At one extreme, we route every
request correctly, and the capacity of the local node is limited by
the slowest neighbor node. (To be more precise: the local capacity C_l
becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and
C_i is the capacity of neighbor i, and S_i is the share of incoming
requests sent to neighbor i. In other words, the local capacity may be
limited not by the slowest node, but by some faster node that gets
sent a disproportionately larger share of the request traffic.)

These four interrelate, at a fundamental theoretical level. Namely,
you can't arbitrarily control all four at the same time. To use the
term loosely, there are four parameters, but only three degrees of
freedom. So, the question becomes, where do we want to exercise
control, and how can we do so most optimally?

This analysis seems to me to lead to a conclusion that we should be
trying to do load management by controlling who our peers are. We
can't do much about incoming request distribution (not entirely true
when you include local requests, more on this later) or peer capacity.
That leaves one degree of freedom between misrouting and peer location
distribution, so we find we're trading between those two.

This is, imho, not a surprising conclusion, but also not one I've seen
explicitly stated before. It's somewhat explicit in our current LRU
peer selection scheme, but not to the degree I think we should be
talking about.

There are a couple factors that complicate all of this. These include
locally originated requests, locally answered requests, and FOAF
routing. Locally answered requests aren't often discussed in detail in
simulations. For example, the simulators I've written all try to route
to the best node on the network (shortest node location to request
location distance). However, node store size variability makes things
more complicated than that. I'm also not sure how to handle it in a
nice clean fashion. We could try simulating it, but that's just going
to be an exercise in GIGO unless we have actual data to use. (Probe
requests as per issue 3568 would be good here.) However, I think we
can get rather far while continuing to ignore that issue.

Locally originated requests are trickier. Some nodes have lots, some
have very few, some vary when people browse freesites, etc. I'm not
really sure what to do about them. They need more careful analysis.
Specifically, I see two conflicting requirements: local requests need
to be taken into account when choosing peers, so that our local
requests don't swamp the long links entirely; and local requests need
to not be taken into account, so that they don't distort the overall
network topology.

FOAF routing, combined with our bandwidth-peer count relationship, is
a start at handling load by managing who our peers are. I think this
is the direction further load management work should take.

I'll end with a few ideas I haven't explored in much detail yet. They
might be good ones or bad ones, but I hope they're at least
interesting.

What if each peer advertises its available outbound bandwidth to its
peers? The peer selection code could then attempt to maintain a
constant ratio of bandwidth per fraction of requests served among
peers.

What about modifying the LRU code to correct the time since last usage
by a factor of number of FOAF locations? That is, if we have peer A
with 8 FOAF locations advertised and peer B with 20, and B is
responding to 1.5 requests for every 1 that A responds to, then A
should be treated as the better node. That is, we favor