[Tech] [Old frost] Delayed socket reads

Matthew Toseland Wed, 8 Aug 2007 15:49:59 +0100

----- Anonymous at o9_0DTuZniSf_+oDmRsonByWxsI ----- 2007.06.17 - 
13:53:13GMT -----


Sounds like the proposal should be expressed in more formal way, to ensure 
that we all do not speak about different things. Maybe comparing it with 
current implementation is a good starting point, so the proposed differences 
are:


- instead of using a single system UDP socket, each connection should be given 
a distinct UDP socket bound to specific remote address. The UDP socket 
non-bound to remote is still used to notice new connections (and connections 
where peer changed its IP address) - once newly seen address authentificated, 
new bound UDP socket created to handle the connection.

Potentially, specific peer might be sending UDP packets from different IP 
address, where the address changes not on daily/hourly basis (which is quite 
common) but a remote NAT assigns a random address from a pool of alloted 
addresses for [virtually] every packet. I havn't seen so badly broken 
internet connections yet; most likely such node would prefer using TCP or 
HTTP over any UDP based transport solution anyway.

The unbound UDP socket is given low weight as far as bandwidth sharing 
algoritm goes. 


- Aside the unbound UDP socket, the per-connection UDP sockets are given equal 
bandwith sharing weight (with unused bandwidth somehow (not neccessarily 
evenly) distributed among peers that could handle more traffic). Each 
peer/socket is handled by a separate thread, and so different peers do never 
block each other on network I/O.

Other words, every peer is guaranteed to get at least 1/(P+1) share of 
incoming bandwidth - never long delays of no read which might lead to 
oscillations (which, as mroger pointed out, could be fought with very 
inertious averages, but no need to provoking them also). But as all the 
momentarily unused bandwidth is made available for other peers/sockets, the 
token bucket is very small, and most likely should be not higher than zero 
(using the following semantics: upon request processing the bucket size 
becomes negative, and next read from the socket might occur only when the 
buckets gets refilled back to zero level at which point the continuous 
background refilling stops).

Using NIO the extra threads could be avoided, but it is not really needed 
seeing that a node is dicouraged having more than a dozen connected peers 
anyway.

The threads will of course be synchronizing on local tokens/bucket handling - 
which is expected to be within a microsecond, and tasks like db access - 
which is very good as it keeps node from spiralling down due to CPU/disk 
performance starvation. Simple implementation would also limit each peer up 
to one simultaneous db access - and that is good, as that throttles the peers 
that abuse our disk resources (by flooding requests with htl=1 for example).


- the current I/O bandwidth liability limiting mechanism is not deprecated at 
all; as toad pointed out multiply times, it is essential for handling tiny 
requests that tend to result in large responses often enough.

Once a packet is read from a bound UDP socket:

a) an insert or request: it is verified against the bandwidth liability 
limiters; is the checks passed, the insert/request accepted as it currently 
does. Otherwise the insert/request gets explicitly rejected 
(FNPRejectOverload) as it currently does. Nothing really changes here.

At the moment of accepting the insert/request the peer token bucket might be 
charged, in order to discourage floods. Then credit the charged tokens to the 
peer we later received the requested data from.

b) data we are awaiting for, either for our node or as forwarded request.
c) auxilarly traffic like acks or location swaps or keepalives.

These undergo no liability limiting, as the transfer already happened - 
exactly the same as now. Notice that if the received traffic is 'valuable' 
(explicitly requested earlier) we credit the peer with the tokens charged 
from the original requestor bucket (the only mechanism to possibly raise the 
bucket size above zero), so this peer is not penalized (at very least not as 
heavily as now) - and so requests of this peer are not getting postponed for 
long.

As the bonus tokens are not just thrown in, but taken from other peers who are 
interested in service, there should be no bursts happenning. At least, 
charging buckets might be done at the moment the data response sent, not at 
the moment of requests (or the charge can be somehow split among those 
timepoints).



Now, a question should be raised - if all the liability limiter stuff is 
expected to work as it already does, why bother? The trick is that delayed 
socket reads will be allowing remote side to notice in advance that the peer 
became slower. The remote will not know if that happens due to exceeding the 
1/(P+1) share of bandwidth, or due to CPU/disk/swap/whatever performance 
shortage: all these situations require slowing down to avoid rejects and/or 
timeouts. So the slowing down is made at 21 levels: first, the delayed socket 
reads smooth the load over longer time period, and second, the requestor will 
start sending/queueing less requests here, more utilizing alternative paths 
and less busy peers. And that's the whole point.


Additional note: this algorithm requires no modification if UDP sockets are 
mixed with TCP ones (provided that TCP window size is comparable to MTU, not 
the default 32/64/128KB).



Until last week I was under impression that the timing based mechanism of 
routeing around overloaded nodes is already implemented. After taking time to 
study the close closer it turns out to be false: AFAIS the routing around 
happens only upon explicit FNPRequestOverload - the 
freenet.node.PeerManager._closerPeer() method does not take the peer 
roundtrip average into account at all. Here is the [rather simple] 
improvement I propose: instead of making routing decision based solely on 
location distance skipping 'unsuitable' locations, I propose searching a peer 
by choosing the best (numerically highest) request advance speed, 
specificially for each node:

- disconnected peer is skipped (as it is now);
- the peers we already tried to search the key at are skipped (as it is now);
- the peer the original request came from is skipped (as it is now);
- once passive requests implemented, other requestors of the key are to be 
skipped too (is they gain the key, the subscription will time out soon and 
then we will be able to try there);
- [the maxDiff and the old version logic is not too clear for me];

Now the real salt:
- the speed is calculated as signedLocationDiff(our_location, peer_location, 
key_location)/RTT(peer) where signedLocationDiff is positive if 
locationDiff(our_location, key_location)>locationDiff(peer_location, 
key_location), and negative otherwise (assuming our_location!=peer_location, 
which would give zero location difference but should never happen). That can 
be multiplied by nodeAveragePingTime if values of predictable range 
(say -3..3) are preferred (but no need to provide guarantees).
- backed off node speed is subtracted by 3 (or thereabout; but maybe smaller 
value, like 1 or even 0.5).

Using the speed for connections sorting is very natural: it basically allows 
to answer the question 'which peer the request should be sent to in order to 
receive response ASAP - preferring to use slightly more bandwidth on 
underloaded nodes (where bandwidth is idle i.e. virtually free) rather than 
waiting for slow/overloaded nodes to do the job, and provided that the 
solicited key already exists where it should'.

And so it gives the following advantages:
- (major) we start routing around busy nodes well before they start generating 
FNPRejectOverload packets.
- (minor) no longer need to scan the peer list twice, first skipping backed 
off nodes then including them.
- (noticeable) requests that result in successful data retrieve will be 
delivering answer to original requestor faster (on average).
- (average) the routes will be more fluid and thus harder for adversary to 
supervise or censor;
- (minor) very fast nodes even with small own storage/cache provide effective 
caching services (for popular keys) for their immediate peers which are slow 
but have huge well-specialized storage.

And note that this algorithm will be useful even for the current 
code/protocol, without delayed socket reads; delaying the socket reads will 
undoubtfully have positive impact by smoothing the choice over time (and 
efficiently fighting any oscillations) and having routing around busy nodes 
to start occuring much earlier (starting from 'boundary' keys, then if the 
node anyway gets increasingly busy, by further narrowing down the location 
space we try at the busy node).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/tech/attachments/20070808/92b3fe4a/attachment.pgp>

[Tech] [Old frost] Delayed socket reads

Reply via email to