[freenet-dev] Load management theory
On Saturday 09 Jul 2011 21:22:04 Evan Daniel wrote: > I've only looked at the new load management emails somewhat, not in > detail. However, it seems to me that there hasn't been a deep enough > theoretical analysis of the issue. I'll try to lay out what I mean by > that here, and start the process. I don't have time to finish it > anytime soon, but hopefully this will be useful anyway. That seems entirely reasonable. In fact it's one of my main worries about NLM! > > I see four major factors in load management, plus a couple others I'll > mention after this that make things more complicated. Those are > capacity of neighbors, location distribution of neighbors, incoming > request distribution, and request misrouting tolerance. This is reasonable. > > Obviously not all neighbors have the same request-handling capacity. > At present, Freenet acknowledges this to a limited degree via the > bandwidth limit / peer limit relationship. It is possible we do not > wish to allow an infinite range of capacities, because of concern > about the effects of (non-malicious) supernodes on the network, as > regards things like security and fault tolerance. Right. Currently we limit the proportion of requests a peer can have. This results in misrouting, even with NLM. I wonder if we need to specify it more explicitly in NLM. > > Neighbor location distribution and incoming request distribution are > closely connected. If the two are matched, then all neighbors will be > sent a similar number of requests. If they're mismatched, some > neighbors will get a larger share of requests than others. Variable > peer capacity means that a mismatch here may be desirable, in order to > produce an outbound request pattern that matches available network > capacity. Possibly ... on the other hand, opennet should ensure a reasonable match. In fact we have people worrying about it being too reasonable - nodes with lots of local traffic have lots of distant peers, nodes with little local traffic have too few distant peers, there have been proposals for how to deal with this. On darknet, swapping ensures peers are distributed reasonably for remote requests; incoming requests should generally be highly specialised as long as we don't have too much local traffic. HOWEVER swapping does not care about variable capacities, so what you say above does apply in this case... > > Lastly, we have misrouting tolerance. At one extreme, we route every > request correctly, and the capacity of the local node is limited by > the slowest neighbor node. (To be more precise: the local capacity C_l > becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and > C_i is the capacity of neighbor i, and S_i is the share of incoming > requests sent to neighbor i. In other words, the local capacity may be > limited not by the slowest node, but by some faster node that gets > sent a disproportionately larger share of the request traffic.) Vive did some work suggesting that misrouting isn't always fatal, or at least that it is possible to usefully take into account performance. However, experience in 0.5 shows that misrouting is generally a very serious problem. The extreme case is 0.5 - we essentially routed load, specialisation was very weak, and as a consequence data persistence and performance was poor. In particular, if misrouting results in not finding data, we go more hops, and we are likely to retry, even if a failed request doesn't use much bandwidth; and if it results in storing data in the wrong place on insert it is even worse, resulting in a lot more effort being spent to find it. NLM takes a fairly strict view of misrouting: 1. We pick up to 4 nodes. These are the top nodes for the request. The first always happens, the second happens if the ideal next node is low-capacity (less than min[median/2, first quartile] of our peers' capacities) the third happens if the request is realtime (which is irrelevant as NLM will be turned off for realtime in the near future), the fourth happens if we can't find a route after 1 minute. 2. We can route to any of these. 3. After we are rejected we reroute, we can add another node as above. 4. However, if we get too many RejectedLoop's, we stop adding more. We *want* to timeout (that is, to fail to get a slot, to fail to route the request forward in 2 minutes), so that we can terminate the request, and tell the originator not to retry for a while because this part of the keyspace is difficult; this is the only mechanism we have for dealing with high capacity / low capacity boundaries at present. This is of course fairly arbitrary... Ideally we'd like to have some theoretical basis for how much misrouting is a good thing, or even calculate it in real time based on some statistic. But *any* amount of misrouting potentially increases the total network load, so we need to be very careful here. IMHO such arbitrary limits are probably a reasonable first approximation, especially
Re: [freenet-dev] Load management theory
On Saturday 09 Jul 2011 21:22:04 Evan Daniel wrote: > I've only looked at the new load management emails somewhat, not in > detail. However, it seems to me that there hasn't been a deep enough > theoretical analysis of the issue. I'll try to lay out what I mean by > that here, and start the process. I don't have time to finish it > anytime soon, but hopefully this will be useful anyway. That seems entirely reasonable. In fact it's one of my main worries about NLM! > > I see four major factors in load management, plus a couple others I'll > mention after this that make things more complicated. Those are > capacity of neighbors, location distribution of neighbors, incoming > request distribution, and request misrouting tolerance. This is reasonable. > > Obviously not all neighbors have the same request-handling capacity. > At present, Freenet acknowledges this to a limited degree via the > bandwidth limit / peer limit relationship. It is possible we do not > wish to allow an infinite range of capacities, because of concern > about the effects of (non-malicious) supernodes on the network, as > regards things like security and fault tolerance. Right. Currently we limit the proportion of requests a peer can have. This results in misrouting, even with NLM. I wonder if we need to specify it more explicitly in NLM. > > Neighbor location distribution and incoming request distribution are > closely connected. If the two are matched, then all neighbors will be > sent a similar number of requests. If they're mismatched, some > neighbors will get a larger share of requests than others. Variable > peer capacity means that a mismatch here may be desirable, in order to > produce an outbound request pattern that matches available network > capacity. Possibly ... on the other hand, opennet should ensure a reasonable match. In fact we have people worrying about it being too reasonable - nodes with lots of local traffic have lots of distant peers, nodes with little local traffic have too few distant peers, there have been proposals for how to deal with this. On darknet, swapping ensures peers are distributed reasonably for remote requests; incoming requests should generally be highly specialised as long as we don't have too much local traffic. HOWEVER swapping does not care about variable capacities, so what you say above does apply in this case... > > Lastly, we have misrouting tolerance. At one extreme, we route every > request correctly, and the capacity of the local node is limited by > the slowest neighbor node. (To be more precise: the local capacity C_l > becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and > C_i is the capacity of neighbor i, and S_i is the share of incoming > requests sent to neighbor i. In other words, the local capacity may be > limited not by the slowest node, but by some faster node that gets > sent a disproportionately larger share of the request traffic.) Vive did some work suggesting that misrouting isn't always fatal, or at least that it is possible to usefully take into account performance. However, experience in 0.5 shows that misrouting is generally a very serious problem. The extreme case is 0.5 - we essentially routed load, specialisation was very weak, and as a consequence data persistence and performance was poor. In particular, if misrouting results in not finding data, we go more hops, and we are likely to retry, even if a failed request doesn't use much bandwidth; and if it results in storing data in the wrong place on insert it is even worse, resulting in a lot more effort being spent to find it. NLM takes a fairly strict view of misrouting: 1. We pick up to 4 nodes. These are the top nodes for the request. The first always happens, the second happens if the ideal next node is low-capacity (less than min[median/2, first quartile] of our peers' capacities) the third happens if the request is realtime (which is irrelevant as NLM will be turned off for realtime in the near future), the fourth happens if we can't find a route after 1 minute. 2. We can route to any of these. 3. After we are rejected we reroute, we can add another node as above. 4. However, if we get too many RejectedLoop's, we stop adding more. We *want* to timeout (that is, to fail to get a slot, to fail to route the request forward in 2 minutes), so that we can terminate the request, and tell the originator not to retry for a while because this part of the keyspace is difficult; this is the only mechanism we have for dealing with high capacity / low capacity boundaries at present. This is of course fairly arbitrary... Ideally we'd like to have some theoretical basis for how much misrouting is a good thing, or even calculate it in real time based on some statistic. But *any* amount of misrouting potentially increases the total network load, so we need to be very careful here. IMHO such arbitrary limits are probably a reasonable first approximation, especially
[freenet-dev] Load management theory
I've only looked at the new load management emails somewhat, not in detail. However, it seems to me that there hasn't been a deep enough theoretical analysis of the issue. I'll try to lay out what I mean by that here, and start the process. I don't have time to finish it anytime soon, but hopefully this will be useful anyway. I see four major factors in load management, plus a couple others I'll mention after this that make things more complicated. Those are capacity of neighbors, location distribution of neighbors, incoming request distribution, and request misrouting tolerance. Obviously not all neighbors have the same request-handling capacity. At present, Freenet acknowledges this to a limited degree via the bandwidth limit / peer limit relationship. It is possible we do not wish to allow an infinite range of capacities, because of concern about the effects of (non-malicious) supernodes on the network, as regards things like security and fault tolerance. Neighbor location distribution and incoming request distribution are closely connected. If the two are matched, then all neighbors will be sent a similar number of requests. If they're mismatched, some neighbors will get a larger share of requests than others. Variable peer capacity means that a mismatch here may be desirable, in order to produce an outbound request pattern that matches available network capacity. Lastly, we have misrouting tolerance. At one extreme, we route every request correctly, and the capacity of the local node is limited by the slowest neighbor node. (To be more precise: the local capacity C_l becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and C_i is the capacity of neighbor i, and S_i is the share of incoming requests sent to neighbor i. In other words, the local capacity may be limited not by the slowest node, but by some faster node that gets sent a disproportionately larger share of the request traffic.) These four interrelate, at a fundamental theoretical level. Namely, you can't arbitrarily control all four at the same time. To use the term loosely, there are four parameters, but only three degrees of freedom. So, the question becomes, where do we want to exercise control, and how can we do so most optimally? This analysis seems to me to lead to a conclusion that we should be trying to do load management by controlling who our peers are. We can't do much about incoming request distribution (not entirely true when you include local requests, more on this later) or peer capacity. That leaves one degree of freedom between misrouting and peer location distribution, so we find we're trading between those two. This is, imho, not a surprising conclusion, but also not one I've seen explicitly stated before. It's somewhat explicit in our current LRU peer selection scheme, but not to the degree I think we should be talking about. There are a couple factors that complicate all of this. These include locally originated requests, locally answered requests, and FOAF routing. Locally answered requests aren't often discussed in detail in simulations. For example, the simulators I've written all try to route to the best node on the network (shortest node location to request location distance). However, node store size variability makes things more complicated than that. I'm also not sure how to handle it in a nice clean fashion. We could try simulating it, but that's just going to be an exercise in GIGO unless we have actual data to use. (Probe requests as per issue 3568 would be good here.) However, I think we can get rather far while continuing to ignore that issue. Locally originated requests are trickier. Some nodes have lots, some have very few, some vary when people browse freesites, etc. I'm not really sure what to do about them. They need more careful analysis. Specifically, I see two conflicting requirements: local requests need to be taken into account when choosing peers, so that our local requests don't swamp the long links entirely; and local requests need to not be taken into account, so that they don't distort the overall network topology. FOAF routing, combined with our bandwidth-peer count relationship, is a start at handling load by managing who our peers are. I think this is the direction further load management work should take. I'll end with a few ideas I haven't explored in much detail yet. They might be good ones or bad ones, but I hope they're at least interesting. What if each peer advertises its available outbound bandwidth to its peers? The peer selection code could then attempt to maintain a constant ratio of bandwidth per fraction of requests served among peers. What about modifying the LRU code to correct the time since last usage by a factor of number of FOAF locations? That is, if we have peer A with 8 FOAF locations advertised and peer B with 20, and B is responding to 1.5 requests for every 1 that A responds to, then A should be treated as the better node. That is, we favor
[freenet-dev] Load management theory
I've only looked at the new load management emails somewhat, not in detail. However, it seems to me that there hasn't been a deep enough theoretical analysis of the issue. I'll try to lay out what I mean by that here, and start the process. I don't have time to finish it anytime soon, but hopefully this will be useful anyway. I see four major factors in load management, plus a couple others I'll mention after this that make things more complicated. Those are capacity of neighbors, location distribution of neighbors, incoming request distribution, and request misrouting tolerance. Obviously not all neighbors have the same request-handling capacity. At present, Freenet acknowledges this to a limited degree via the bandwidth limit / peer limit relationship. It is possible we do not wish to allow an infinite range of capacities, because of concern about the effects of (non-malicious) supernodes on the network, as regards things like security and fault tolerance. Neighbor location distribution and incoming request distribution are closely connected. If the two are matched, then all neighbors will be sent a similar number of requests. If they're mismatched, some neighbors will get a larger share of requests than others. Variable peer capacity means that a mismatch here may be desirable, in order to produce an outbound request pattern that matches available network capacity. Lastly, we have misrouting tolerance. At one extreme, we route every request correctly, and the capacity of the local node is limited by the slowest neighbor node. (To be more precise: the local capacity C_l becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and C_i is the capacity of neighbor i, and S_i is the share of incoming requests sent to neighbor i. In other words, the local capacity may be limited not by the slowest node, but by some faster node that gets sent a disproportionately larger share of the request traffic.) These four interrelate, at a fundamental theoretical level. Namely, you can't arbitrarily control all four at the same time. To use the term loosely, there are four parameters, but only three degrees of freedom. So, the question becomes, where do we want to exercise control, and how can we do so most optimally? This analysis seems to me to lead to a conclusion that we should be trying to do load management by controlling who our peers are. We can't do much about incoming request distribution (not entirely true when you include local requests, more on this later) or peer capacity. That leaves one degree of freedom between misrouting and peer location distribution, so we find we're trading between those two. This is, imho, not a surprising conclusion, but also not one I've seen explicitly stated before. It's somewhat explicit in our current LRU peer selection scheme, but not to the degree I think we should be talking about. There are a couple factors that complicate all of this. These include locally originated requests, locally answered requests, and FOAF routing. Locally answered requests aren't often discussed in detail in simulations. For example, the simulators I've written all try to route to the best node on the network (shortest node location to request location distance). However, node store size variability makes things more complicated than that. I'm also not sure how to handle it in a nice clean fashion. We could try simulating it, but that's just going to be an exercise in GIGO unless we have actual data to use. (Probe requests as per issue 3568 would be good here.) However, I think we can get rather far while continuing to ignore that issue. Locally originated requests are trickier. Some nodes have lots, some have very few, some vary when people browse freesites, etc. I'm not really sure what to do about them. They need more careful analysis. Specifically, I see two conflicting requirements: local requests need to be taken into account when choosing peers, so that our local requests don't swamp the long links entirely; and local requests need to not be taken into account, so that they don't distort the overall network topology. FOAF routing, combined with our bandwidth-peer count relationship, is a start at handling load by managing who our peers are. I think this is the direction further load management work should take. I'll end with a few ideas I haven't explored in much detail yet. They might be good ones or bad ones, but I hope they're at least interesting. What if each peer advertises its available outbound bandwidth to its peers? The peer selection code could then attempt to maintain a constant ratio of bandwidth per fraction of requests served among peers. What about modifying the LRU code to correct the time since last usage by a factor of number of FOAF locations? That is, if we have peer A with 8 FOAF locations advertised and peer B with 20, and B is responding to 1.5 requests for every 1 that A responds to, then A should be treated as the better node. That is, we favor