> > FROM: finney.org > DATE: 04/19/2000 12:56:32 > SUBJECT: RE: [Freenet-dev] Proposal for the Near Future > (Searching, CHKs and encryption, ..., oh my!) > > I want to reiterate a comment I made earlier, with regard to > storing > things into the Freenet under a "searchkey" like mp3. This is > not going > to work, because too many documents will use that keyword, and > they will > all try to go onto that one node (even if the "documents" are > just index > or metadata entries there are too many).
I read your concerns before and can totally see where you are coming from. There certainly will be an increased load on IPs that the smart routing thinks should the the home for hashes of popular keywords. There are other things to consider though. I don't know how the routing algorithm works exactly but it seems to me that it's focus can be adjusted. What I mean is the "best" IP for a particular hash may not strictly be one single IP but rather a group of IPs. By adjusting the fuzziness of the targeting you might reduce the efficiency of the routing mechanism by a hop or two but the load on the targeted server will be dropped by a lot more. Also, remember that the inserts are caches along the way so a request will likely get filled (the HTL runs out) *long* before it reaches the focus of the insert. I suspect that this will improve as the freenet gets bigger and bigger since each node will know a smaller and smaller proportion of the whole freenet. Right now, each node knows the vast majority of the freenet and the target focus is reached quickly ... in fact there is a good chance that *the best* IP is already referenced in its store and on an insert the first hop is to it. Since it doesn't have itself in its reference, it sends in out to it's best IP. That IP would likely try to send it back to *the best* IP but since it already has it it will send it to the second best ... etc., etc., etc. Once the focus is reached, each node afterward will try to send it to the nodes that already have it ... if this is how the routing works then certainly a bit of fuzziness is in order _now_ simply for normal operation. The system is undamped and in need some damping ... to put things in control system terms. This can be provided with a random scattering of the target match. For instance compairing the insertion hash against a sequential random selection of 80% of the digits of each referenced IP in the routing process. This would make sure that individual networks are not swamped either (for instance 123.56.78.* might have a really big affinity for the hash of the keyword mp3. > > But if we do this, the primary keyword can`t be a common word > like > "mp3". It would be OK as a secondary keyword but as a primary > it would > be too common. You could search with primary keyword = > "backstreet boys" > and secondary = "mp3", and that wouldn`t have so many collisions. Typically you would insert your data under keywords that are the most specifically descriptive. Also, you would use the most descriptive keyword first in your search also (as I remember you mentioning before). One would be somewhat foolish to insert a "backstreet boys" (*shudder*) mp3 under the keywords "mp3" or "music". Rather you would find that your data would be more retrievable if it (or references to it) was inserted under the keywords "backstreet boys" or the specific title of the song. I think the insertion and requests of common keywords will be self regulating in the fact that you won't find anything of quality using them ... not that the backstreet boys are quality but to each their own :] > <snip> > > What I proposed was that nodes would simply refuse to accept > data if > they already have too many entries with the identical primary > searchkey > as the incoming. So an attempt to insert an entry under > searchkey of > "mp3" would simply (and perhaps silently) fail since the node > would > already have too many such entries. Using "backstreet boys/mp3" > would > be more likely to succeed but even that might be too much > crowding for > some nodes. Using "backstreet boys/i want it that way/mp3" > would be > much less likely to collide. I think limiting the number of identical KHKs in the store is a good idea and will also blur the focus of the routing a bit. But it may lead to the orbiting or undamped data insertions/requests like I described above. > > There would therefore be a disadvantage to using primary > searchkeys > that are common. They would be unlikely to be retained on > insert, > and therefore unlikely to return all the possible hits on > retrieve. > Using more combinations of keywords as primary searchkeys would > make > the data more likely to be available on Freenet, but at the cost > of > requiring users to specify several keywords in order to find the > data. When you use a web search engine to find information on a particular book do you start your search at "paper", "ink", "book"? I don't think freenetizens will either. Nor will people index their data under such vague keywords either unless they want their data to die. I think more specific keywords alone are a better bet than specific keywords strapped to general ones. There needs to be a way to search for boolean combinations of keywords within one search attempt, however, and this can be achieved through the method that you gave earlier where you would include the rest of the more general keywords inside the document metadata. That way you can do a search for "backstreet boys" AND the metadata containing the "content-type=book", you wouldn't get a flood of mp3 but rather a book about the backstreet boys (*shudder*) Mike _______________________________________________ Freenet-dev mailing list Freenet-dev at lists.sourceforge.net http://lists.sourceforge.net/mailman/listinfo/freenet-dev
