This guy seems to have some interesting ideas about improving Freenet
performance:

----- Forwarded message from Ray Heasman <nurf at wuzzle.org> -----

From: Ray Heasman <[email protected]>
To: ian at octayne.com
Cc: nurf at wuzzle.org
Subject: Some useful (I hope) musings on freenet

Hi Ian,

I was getting frustrated trying to use FreeNet, and spent some time trying
to understand how it works. I also tried to come up with some useful
suggestions to make things better.

Would you mind looking through it and pointing out any glaring errors in my
underestanding? I would like to send this to the mailing lists, but dont
want stupid mistakes to get in the way of real evaluation of the ideas
presented. Obviously, I would have to change any conclusions that depended
on incorrect understanding too.

Thanks,
Ray Heasman

-------------------------

Some musings on the architecture of FreeNet

Contents

1. Introduction
2. Observations
   2.1 Reliability
   2.2 Model
3. Insertion of data
4. Retrieveal of data
5. Search keys
   5.1 Reservations
6. Potential problems
   6.1 Further exacerbation
7. Suggestions
   7.1 Disable caching
   7.2 Limit caching
       7.2.1 Probabilistic caching
       7.2.2 Absolute limiting
8. Repercussions of the splitting of the model
   8.1 Possible solutions
9. Conclusion

References




1. Introduction

These ramblings are the result of my attempt to understand how the FreeNet
protocol works. It is also an attempt to analyse the current architecture so
as to provide some points of discussion. I have attempted to show my
thoughts in as rational a matter as possible. However, I am sure I have some
things wrong. Please correct me so that I can improve this document along
with my understanding.

2. Observations

2.1 Reliability

Currently, FreeNet is not very reliable, and I think that is because it is
in some way fundamentally broken. However, I don't think that it is very far
away from a useful working solution either. i.e. I think that small changes
could result in massive improvements in reliability. (And I don't think that
"permanence" is a requirement to make it work better, either)

Personal anecdotal experience suggests that attempting to request any file
that is not wildly popular will fail.

2.2 Model

FreeNet can be viewed as two completely different networks that have been
implemented in the same protocol.

a) It is a distributed network that allows storage and retrieval of data
through a self-organising routing mechanism.

b) It is a distributed network where each node caches all data it sees.

I think behaviour inherent in (b) can clash seriously with the requirements
for (a). Furthermore, it permutes the behaviour of the network to the extent
that it becomes difficult to examine the effects of architectural decisions
made to implement (a).

I will revisit this line of thought later.

3. Insertion of data

My understanding is that FreeNet performs a depth-first backtracking graph
traversal of nodes on the network. Insertion has two stages: first, a search
to check that the data to be inserted does not already exist, second, an
insertion phase that tunnels data to the last point at which the search
failed.

During the request phase, each node forwards a request to nodes it thinks
most likely of having the data being inserted. This is done by making an
estimate of how close the key describing the search is to the keys usually
handled by the nodes it knows about. Unique IDs are used so that requests
are never processed twice by the same node (i.e. They are used by the search
algorithm to change the graph of nodes into a dynamically generated tree
that contains no loops). This eliminates redundant searching.

Search packets also contain a "hops to live" parameter (HTL), which is
decremented every time the search packet is forwarded to a node. 

Assuming the data to be inserted is not already on the network, eventually
HTL drops to 0, and the lucky node thus found signals the search failed. At
this point, the originating node starts streaming the data back along the
path searched (sans backtracks) to the lucky node.

What I find interesting about this is that we are storing data in the last
place the network looked (rather than the first). Admittedly, this is a
directed search so after a few hops all places searched are likely
candidates for the target data.

4. Retrieval of data

Retrieval of data is simply the search phase described in section 3, with
the data being returned along the search path if found.

5. Search keys

Keys are 160 bit hashs of the data to be stored. This means that data with
similar keys may not be associated in terms of content at all. This is a
good thing as data are not clustered on servers by subject, making the loss
of a node less damaging. However for searching to work, there has to be a
concept of "closeness" of keys. Adam Langley provides a rough description of
how this is currently done in [AL1]. This combines the test of whether one
key is "less" than another with a binary tree to find the closest key in a
group. Closeness is defined as the distance between leaves on this tree.

5.1 Reservations

One thing that worries me about this definition of closeness is that it is
entirely dependent on what keys a node currently holds. An outside observer
with knowledge of only a subset of keys on the node would not be able to
predict the closeness of two keys calculated by that node. It also means
that as each node prunes and updates its node routing list it will change
the behaviour of future routing decisions in unpredictable ways.

This may or may not affect routing, but it certainly does affect our ability
to predict how the distributed network will behave.

6. Potential problems

The model mentioned in section 2 suggests a potential problem. In order to
be a good self-organising network that routes requests to nodes containing
their associated data, it is necessary for nodes in the vicinity of the
insertion point to have many requests routed through them. In the current
implementation, caching is very aggressive: a node caches data the first
time it sees it and declines to request that data again if a request arrives
for it.

This has the profound effect of moving the data away from the point of
insertion. ie. It starves the routing mechanism of data. Cached files are so
preferred by the network that after one or two hits, the original data is
never fetched again. After a while the network forgets how to route to that
file, simply because it doesn't have to. Later, because of the incredibly
high level of thrashing in the caches, all the caches drop the file, and the
file is essentially lost.

In short, caching may be the cancer killing freenet.

6.1 Further exacerbation

Furthermore, the cache shares space with inserted files - a thrashing cache
guarantees the loss of all data legitimately inserted into a node.

Arguably, in the worst case, the capacity of the current freenet is the
total number of bytes of storage available divided by the number of nodes.
If every node requests every file with a non-uniform probability
(popularity), or is made to request every file by another node, each node
will end up with a random sample of the most popular files with an
approximate total size of the average node capacity. This strikes me as
being rather inefficient.

The best case is a freenet where no attempts are made to fetch data, which
is hardly ideal.

7. Suggestions

There are unfortunate consequences to fiddling with the current
implementation of freenet - people are starting to use it, and thus it is
difficult to make radical changes to its behaviour in the interests of
experimentation.

7.1 Disable caching

A draconian experiment that would be extremely useful would be to simply
disable all caching in future versions of the freenet client. This would let
us determine the following:

a) Does the routing algorithm work? Simulations have been done, but do they
match reality well enough?

b) How well does it work?

c) How can we make it better? We could test some questions, like:
        i)   Does it help to make the search algorithm stochoastic?
        ii)  Does it help for data to be inserted at more than one point?
        iii) Should we use a more absolute key comparison method such
             as hamming distances or log(|knownkey-searchkey|)?


7.2 Limit caching

Another option is to severely curtail the caching without completely
disabling it. 

7.2.1 Probabilistic caching

A node could have two parameters. The first parameter (call it C) would
specify the probability of the node caching a file. The second parameter
(call it H) is the probability that a request for the file would produce a
"hit" on the cached data.

C would control how aggressively the network caches files, and H would
control what percentage of requests would make it back to the original
insertion point.

C is perhaps not a good idea - nodes will spottily cache a file all over the
network, perhaps diluting routing information. However, a bottleneck in the
network would be guaranteed to receive more requests for a popular file and
would be more likely to cache that file.

H is probably always a good idea - it ensures that the routing mechanism is
constantly excercised, and in direct proportion to the popularity of a file.
This would go well with a mildly stochoastic search algorithm that has a
small probability of picking a different order for search targets at a node.
Such an algorithm would smooth the search space, providing more even
gradients to the nearest "gully".

A possible improvement would be to specify that requests forced by the H
parameter be flagged so that no data would be returned once they hit a node
upstream. This would limit caching to areas near "forks" in "tributaries" in
the search graph, along with limiting wasted bandwidth. Along with some form
of limits on caching, this would produce a "blockade" effect: Nodes that
tend to concentrate requests for a particular file would be more likely to
cache that file. By doing so, they would ensure that any nodes upstream of
them ignore requests through the cache. This would mean the you would tend
to have a "blockade" of caches between all destinations and the source, and
_only_ at concentration points and _only_ a limited number of times between
any destination and the source (1 for popular files, a small number for
wildly popular files).

7.2.2 Absolute limiting

A node could either limit the amount of space dedicated to cached data, or
limit the number of files cached. This implies that cached data and "real"
data should not share the same repository. This would at least stop real
data from being sacrificed on the altar of caching. It does not address the
routing dilution effect at all, though. One approach would be to use this
method along with the H parameter suggested in 7.2.1.

It may be that if the caches are limited enough, they will thrash enough to
ensure that the network constantly relearns how to route to its original
files.

I think the implication forced by absolute limiting of splitting cache data
and inserted data is far more important than the method used to split them.

8. Repercussions of the splitting of the model

There are consequences of treating cached data as a copy of some original
data somewhere else on the network. Simply, once the original data at the
insertion point is lost, the caches are sure to follow eventually. This
could make freenet less resistant to attack.

8.1 Possible solutions

One solution is to modify inserts to insert several copies of a file. This
could be messy and is at best a postponement of the problem, rather than a
fix. One advantage of the current freenet architecture is that once a file
has become popular, it is difficult to make copies "go away". (However, I
will take no bets on the health of that same file 15 minutes after its
moment of notoriety.)

A better solution would be to "canonise" files in the cache by moving a
certain class of files from the cache to the inserted-data repository. This
would be done with a certain probability if the cache notices that requests
sent upstream (ie. towards the source of the data) suddenly start failing.

Such a set of criteria could be:

IF (data is going to be dumped from cache due to space constraints) AND
   (data has been requested > k times) AND
   (number of consecutive upstream H-tagged failures > j%)
THEN canonise data.

Canonisation should be kept to a minimum to minimise the number of changes
the self-organising routing will have to accomodate.

Such cononised files would then be lost in the usual way - ie. through
replacement due to space constraints in the inserted-data cache. This would
mean that an attacker would have to flood the entire freenet (due to content
hashing) to ensure the loss of a file.

9. Conclusion

I have presented how I think freenet works, and a hypothesis why it does not
work better. I have shown my arguments for the hypothesis, and a list of
possible solutions along with my estimates of how they would affect freenet.

A freenet that explicitly distinguishes between cached data and inserted
data would have to work slightly differently to the current freenet.
However, I think that doing so will allow better routing accuracy, along
with more efficient use of space. The current freenet thrashes its caches at
the expense of its ability to route, whilst destroying its internal data
store by replacing real data with copies of the flavour of the moment.


References

[AL1] : "The Freenet Protocol", 
http://freenetproject.org/index.php?page=protocol








----- End forwarded message -----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20010526/702ee8ab/attachment.pgp>

Reply via email to