This guy seems to have some interesting ideas about improving Freenet performance:
----- Forwarded message from Ray Heasman <nurf at wuzzle.org> ----- From: Ray Heasman <[email protected]> To: ian at octayne.com Cc: nurf at wuzzle.org Subject: Some useful (I hope) musings on freenet Hi Ian, I was getting frustrated trying to use FreeNet, and spent some time trying to understand how it works. I also tried to come up with some useful suggestions to make things better. Would you mind looking through it and pointing out any glaring errors in my underestanding? I would like to send this to the mailing lists, but dont want stupid mistakes to get in the way of real evaluation of the ideas presented. Obviously, I would have to change any conclusions that depended on incorrect understanding too. Thanks, Ray Heasman ------------------------- Some musings on the architecture of FreeNet Contents 1. Introduction 2. Observations 2.1 Reliability 2.2 Model 3. Insertion of data 4. Retrieveal of data 5. Search keys 5.1 Reservations 6. Potential problems 6.1 Further exacerbation 7. Suggestions 7.1 Disable caching 7.2 Limit caching 7.2.1 Probabilistic caching 7.2.2 Absolute limiting 8. Repercussions of the splitting of the model 8.1 Possible solutions 9. Conclusion References 1. Introduction These ramblings are the result of my attempt to understand how the FreeNet protocol works. It is also an attempt to analyse the current architecture so as to provide some points of discussion. I have attempted to show my thoughts in as rational a matter as possible. However, I am sure I have some things wrong. Please correct me so that I can improve this document along with my understanding. 2. Observations 2.1 Reliability Currently, FreeNet is not very reliable, and I think that is because it is in some way fundamentally broken. However, I don't think that it is very far away from a useful working solution either. i.e. I think that small changes could result in massive improvements in reliability. (And I don't think that "permanence" is a requirement to make it work better, either) Personal anecdotal experience suggests that attempting to request any file that is not wildly popular will fail. 2.2 Model FreeNet can be viewed as two completely different networks that have been implemented in the same protocol. a) It is a distributed network that allows storage and retrieval of data through a self-organising routing mechanism. b) It is a distributed network where each node caches all data it sees. I think behaviour inherent in (b) can clash seriously with the requirements for (a). Furthermore, it permutes the behaviour of the network to the extent that it becomes difficult to examine the effects of architectural decisions made to implement (a). I will revisit this line of thought later. 3. Insertion of data My understanding is that FreeNet performs a depth-first backtracking graph traversal of nodes on the network. Insertion has two stages: first, a search to check that the data to be inserted does not already exist, second, an insertion phase that tunnels data to the last point at which the search failed. During the request phase, each node forwards a request to nodes it thinks most likely of having the data being inserted. This is done by making an estimate of how close the key describing the search is to the keys usually handled by the nodes it knows about. Unique IDs are used so that requests are never processed twice by the same node (i.e. They are used by the search algorithm to change the graph of nodes into a dynamically generated tree that contains no loops). This eliminates redundant searching. Search packets also contain a "hops to live" parameter (HTL), which is decremented every time the search packet is forwarded to a node. Assuming the data to be inserted is not already on the network, eventually HTL drops to 0, and the lucky node thus found signals the search failed. At this point, the originating node starts streaming the data back along the path searched (sans backtracks) to the lucky node. What I find interesting about this is that we are storing data in the last place the network looked (rather than the first). Admittedly, this is a directed search so after a few hops all places searched are likely candidates for the target data. 4. Retrieval of data Retrieval of data is simply the search phase described in section 3, with the data being returned along the search path if found. 5. Search keys Keys are 160 bit hashs of the data to be stored. This means that data with similar keys may not be associated in terms of content at all. This is a good thing as data are not clustered on servers by subject, making the loss of a node less damaging. However for searching to work, there has to be a concept of "closeness" of keys. Adam Langley provides a rough description of how this is currently done in [AL1]. This combines the test of whether one key is "less" than another with a binary tree to find the closest key in a group. Closeness is defined as the distance between leaves on this tree. 5.1 Reservations One thing that worries me about this definition of closeness is that it is entirely dependent on what keys a node currently holds. An outside observer with knowledge of only a subset of keys on the node would not be able to predict the closeness of two keys calculated by that node. It also means that as each node prunes and updates its node routing list it will change the behaviour of future routing decisions in unpredictable ways. This may or may not affect routing, but it certainly does affect our ability to predict how the distributed network will behave. 6. Potential problems The model mentioned in section 2 suggests a potential problem. In order to be a good self-organising network that routes requests to nodes containing their associated data, it is necessary for nodes in the vicinity of the insertion point to have many requests routed through them. In the current implementation, caching is very aggressive: a node caches data the first time it sees it and declines to request that data again if a request arrives for it. This has the profound effect of moving the data away from the point of insertion. ie. It starves the routing mechanism of data. Cached files are so preferred by the network that after one or two hits, the original data is never fetched again. After a while the network forgets how to route to that file, simply because it doesn't have to. Later, because of the incredibly high level of thrashing in the caches, all the caches drop the file, and the file is essentially lost. In short, caching may be the cancer killing freenet. 6.1 Further exacerbation Furthermore, the cache shares space with inserted files - a thrashing cache guarantees the loss of all data legitimately inserted into a node. Arguably, in the worst case, the capacity of the current freenet is the total number of bytes of storage available divided by the number of nodes. If every node requests every file with a non-uniform probability (popularity), or is made to request every file by another node, each node will end up with a random sample of the most popular files with an approximate total size of the average node capacity. This strikes me as being rather inefficient. The best case is a freenet where no attempts are made to fetch data, which is hardly ideal. 7. Suggestions There are unfortunate consequences to fiddling with the current implementation of freenet - people are starting to use it, and thus it is difficult to make radical changes to its behaviour in the interests of experimentation. 7.1 Disable caching A draconian experiment that would be extremely useful would be to simply disable all caching in future versions of the freenet client. This would let us determine the following: a) Does the routing algorithm work? Simulations have been done, but do they match reality well enough? b) How well does it work? c) How can we make it better? We could test some questions, like: i) Does it help to make the search algorithm stochoastic? ii) Does it help for data to be inserted at more than one point? iii) Should we use a more absolute key comparison method such as hamming distances or log(|knownkey-searchkey|)? 7.2 Limit caching Another option is to severely curtail the caching without completely disabling it. 7.2.1 Probabilistic caching A node could have two parameters. The first parameter (call it C) would specify the probability of the node caching a file. The second parameter (call it H) is the probability that a request for the file would produce a "hit" on the cached data. C would control how aggressively the network caches files, and H would control what percentage of requests would make it back to the original insertion point. C is perhaps not a good idea - nodes will spottily cache a file all over the network, perhaps diluting routing information. However, a bottleneck in the network would be guaranteed to receive more requests for a popular file and would be more likely to cache that file. H is probably always a good idea - it ensures that the routing mechanism is constantly excercised, and in direct proportion to the popularity of a file. This would go well with a mildly stochoastic search algorithm that has a small probability of picking a different order for search targets at a node. Such an algorithm would smooth the search space, providing more even gradients to the nearest "gully". A possible improvement would be to specify that requests forced by the H parameter be flagged so that no data would be returned once they hit a node upstream. This would limit caching to areas near "forks" in "tributaries" in the search graph, along with limiting wasted bandwidth. Along with some form of limits on caching, this would produce a "blockade" effect: Nodes that tend to concentrate requests for a particular file would be more likely to cache that file. By doing so, they would ensure that any nodes upstream of them ignore requests through the cache. This would mean the you would tend to have a "blockade" of caches between all destinations and the source, and _only_ at concentration points and _only_ a limited number of times between any destination and the source (1 for popular files, a small number for wildly popular files). 7.2.2 Absolute limiting A node could either limit the amount of space dedicated to cached data, or limit the number of files cached. This implies that cached data and "real" data should not share the same repository. This would at least stop real data from being sacrificed on the altar of caching. It does not address the routing dilution effect at all, though. One approach would be to use this method along with the H parameter suggested in 7.2.1. It may be that if the caches are limited enough, they will thrash enough to ensure that the network constantly relearns how to route to its original files. I think the implication forced by absolute limiting of splitting cache data and inserted data is far more important than the method used to split them. 8. Repercussions of the splitting of the model There are consequences of treating cached data as a copy of some original data somewhere else on the network. Simply, once the original data at the insertion point is lost, the caches are sure to follow eventually. This could make freenet less resistant to attack. 8.1 Possible solutions One solution is to modify inserts to insert several copies of a file. This could be messy and is at best a postponement of the problem, rather than a fix. One advantage of the current freenet architecture is that once a file has become popular, it is difficult to make copies "go away". (However, I will take no bets on the health of that same file 15 minutes after its moment of notoriety.) A better solution would be to "canonise" files in the cache by moving a certain class of files from the cache to the inserted-data repository. This would be done with a certain probability if the cache notices that requests sent upstream (ie. towards the source of the data) suddenly start failing. Such a set of criteria could be: IF (data is going to be dumped from cache due to space constraints) AND (data has been requested > k times) AND (number of consecutive upstream H-tagged failures > j%) THEN canonise data. Canonisation should be kept to a minimum to minimise the number of changes the self-organising routing will have to accomodate. Such cononised files would then be lost in the usual way - ie. through replacement due to space constraints in the inserted-data cache. This would mean that an attacker would have to flood the entire freenet (due to content hashing) to ensure the loss of a file. 9. Conclusion I have presented how I think freenet works, and a hypothesis why it does not work better. I have shown my arguments for the hypothesis, and a list of possible solutions along with my estimates of how they would affect freenet. A freenet that explicitly distinguishes between cached data and inserted data would have to work slightly differently to the current freenet. However, I think that doing so will allow better routing accuracy, along with more efficient use of space. The current freenet thrashes its caches at the expense of its ability to route, whilst destroying its internal data store by replacing real data with copies of the flavour of the moment. References [AL1] : "The Freenet Protocol", http://freenetproject.org/index.php?page=protocol ----- End forwarded message ----- -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 232 bytes Desc: not available URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20010526/702ee8ab/attachment.pgp>
