On Monday 19 July 2010 21:55:26 Matthew Toseland wrote: > FASD paper: > http://freenetproject.org/papers/kronfol_final_thesis.pdf > > The FASD paper enables keyword searching but is spammable. Basically this is > really easy (I haven't read it in a while so my terminology may be completely > different to his): > - Keywords are hashed, of course, and these hashes translate to routing > locations. > - An index element has a list of keywords connected to it. > - These are inserted as normal keys, except that they are inserted to the > locations of *each* of the keywords. > - A search can be routed to all of the keyword locations or to some subset of > them. It will include a list of keywords. > - The receiving nodes will send back any index elements that match the search. > - Standard algorithms for relevance are used to rank results (based on > relative frequency etc). This can easily be extended for arbitrary > insert-time relevance numbers. > - I don't recall whether flow control was specified, it could become a > problem, but the obvious solution is just to limit the number of returned > items. > > The fundamental problem has always been spam: These are effectively KSKs, and > when you fetch them you propagate all the spam that the spammer has inserted. > > So solve it the same way we solve this for chat. All we have to do to make > FASD feasible is: > - Make the items signed. > - Send a compressed list of the inserter identities we are interested in > along with the request. > > Clearly scalability is limited in much the same way as it is for chat. > However it should be workable provided there is a certain amount of > aggregation and centralisation at the end-user level. And it's a big > improvement from having to download several huge indexes and intersect them > on the client. Arguably it may be more efficient to use this mechanism only > for popular keywords; for unpopular keywords the btree indexes should be very > efficient. And it's probably a useful primitive for many applications... > > Could it be made fully distributed, without needing a WoT? Yes, but only with: > 1) Some sort of reputation feedback mechanism. This most likely involves not > propagating, or de-propagating, content that was fetched and not wanted, and > 2) A sufficiently high cost for inserts of keywords to deter spamming. E.g. > something similar to the Scarce KSKs proposal - limit the number of keyword > index items that can be inserted to one per X seconds per link. > > These are in principle feasible but may open serious attacks e.g. the first > may introduce censorship attacks, links to the wider question of is it safe > ever to allow to discover the location of a document without propagating it. > IMHO the main issue there is path folding, and we can turn path folding off > for keyword searches (just like we turn it off for SSKs and inserts). Plus, > will users give sufficient useful feedback? > > KISS: Simplest mechanism: Keep an LRU (random replacement better for data > storage but not here). When a fetch happens, generate an unforgeable token > for each returned datum. If the user gives positive feedback, this is sent > back to us, we "spend" the token (i.e. commit it to persistent storage, it > cannot be spent twice, like digicash), and promote the item. This would be an > operation similar to an insert, which would promote on those nodes where it > matched the original fetch. Obviously it will only work for a relatively > short period after the original request. We might have recently fetched items > stored in a special cache to be promoted into the main cache if feedback > comes in, and dropped if it doesn't. To prevent abuse, we could perhaps use > rate limiting similar to that proposed for inserting? A spammer keeping stuff > in the cache indefinitely without anyone else fetching it may be possible by > repeatedly requesting (he can't repeatedly insert) - but not if we don't > promote items that we have already returned from a node that was close to the > originator. Or we just impose rate limiting on search requests as well as > inserts. Or ignore it - spam is only a problem if it is in volume, and > inserts are throttled anyway. > > Hmmm, what exactly does FASD say about this? My recollection was that FASD > was spammable. I should really read the paper again! It looks like it *might* > be a tractable problem ... > > And can these mechanisms prevent spam on chat too?? Yes, chat is essentially > a search problem. > I would just like to point out that we are a long way away from the above: - Scarce KSKs or similar mechanism is equivalent to token passing. It's hard. - Even for the single-inserter/WoT case, load balancing would be a problem; one solution is to "partition by document and by keyword" in the jargon of RFC4981 i.e. segment either all documents, or the results from a single keyword, into manageable-sized groups which can be inserted to different locations. The latter gives us sane sized groupings, and the different sub-keywords could be indicated in the btree - we fetch from the btree, if the matches are small we just get the data; if the matches are big we get a list of keyword hashes which we then search for with the above mechanism (this will be faster than downloading and intersecting multi-gigabyte index files).
For the time being, btree indexes and WoT are the way to go. There is much that can be done to improve on them. On-network intersection is stuff to consider much later on e.g. post-1.0. Hence my posting it to tech at . -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/tech/attachments/20100720/6ebf854d/attachment.pgp>
