[Tech] Semi-distributed searching (and maybe fully distributed too)

Matthew Toseland Tue, 20 Jul 2010 13:55:27 +0100

On Monday 19 July 2010 21:55:26 Matthew Toseland wrote:
> FASD paper:
> http://freenetproject.org/papers/kronfol_final_thesis.pdf
> 
> The FASD paper enables keyword searching but is spammable. Basically this is 
> really easy (I haven't read it in a while so my terminology may be completely 
> different to his):
> - Keywords are hashed, of course, and these hashes translate to routing 
> locations.
> - An index element has a list of keywords connected to it.
> - These are inserted as normal keys, except that they are inserted to the 
> locations of *each* of the keywords.
> - A search can be routed to all of the keyword locations or to some subset of 
> them. It will include a list of keywords.
> - The receiving nodes will send back any index elements that match the search.
> - Standard algorithms for relevance are used to rank results (based on 
> relative frequency etc). This can easily be extended for arbitrary 
> insert-time relevance numbers.
> - I don't recall whether flow control was specified, it could become a 
> problem, but the obvious solution is just to limit the number of returned 
> items.
> 
> The fundamental problem has always been spam: These are effectively KSKs, and 
> when you fetch them you propagate all the spam that the spammer has inserted.
> 
> So solve it the same way we solve this for chat. All we have to do to make 
> FASD feasible is:
> - Make the items signed.
> - Send a compressed list of the inserter identities we are interested in 
> along with the request.
> 
> Clearly scalability is limited in much the same way as it is for chat. 
> However it should be workable provided there is a certain amount of 
> aggregation and centralisation at the end-user level. And it's a big 
> improvement from having to download several huge indexes and intersect them 
> on the client. Arguably it may be more efficient to use this mechanism only 
> for popular keywords; for unpopular keywords the btree indexes should be very 
> efficient. And it's probably a useful primitive for many applications...
> 
> Could it be made fully distributed, without needing a WoT? Yes, but only with:
> 1) Some sort of reputation feedback mechanism. This most likely involves not 
> propagating, or de-propagating, content that was fetched and not wanted, and
> 2) A sufficiently high cost for inserts of keywords to deter spamming. E.g. 
> something similar to the Scarce KSKs proposal - limit the number of keyword 
> index items that can be inserted to one per X seconds per link.
> 
> These are in principle feasible but may open serious attacks e.g. the first 
> may introduce censorship attacks, links to the wider question of is it safe 
> ever to allow to discover the location of a document without propagating it. 
> IMHO the main issue there is path folding, and we can turn path folding off 
> for keyword searches (just like we turn it off for SSKs and inserts). Plus, 
> will users give sufficient useful feedback?
> 
> KISS: Simplest mechanism: Keep an LRU (random replacement better for data 
> storage but not here). When a fetch happens, generate an unforgeable token 
> for each returned datum. If the user gives positive feedback, this is sent 
> back to us, we "spend" the token (i.e. commit it to persistent storage, it 
> cannot be spent twice, like digicash), and promote the item. This would be an 
> operation similar to an insert, which would promote on those nodes where it 
> matched the original fetch. Obviously it will only work for a relatively 
> short period after the original request. We might have recently fetched items 
> stored in a special cache to be promoted into the main cache if feedback 
> comes in, and dropped if it doesn't. To prevent abuse, we could perhaps use 
> rate limiting similar to that proposed for inserting? A spammer keeping stuff 
> in the cache indefinitely without anyone else fetching it may be possible by 
> repeatedly requesting (he can't repeatedly insert) - but not if we don't 
> promote items that we have already returned from a node that was close to the 
> originator. Or we just impose rate limiting on search requests as well as 
> inserts. Or ignore it - spam is only a problem if it is in volume, and 
> inserts are throttled anyway.
> 
> Hmmm, what exactly does FASD say about this? My recollection was that FASD 
> was spammable. I should really read the paper again! It looks like it *might* 
> be a tractable problem ...
> 
> And can these mechanisms prevent spam on chat too?? Yes, chat is essentially 
> a search problem.
> 
I would just like to point out that we are a long way away from the above:
- Scarce KSKs or similar mechanism is equivalent to token passing. It's hard.
- Even for the single-inserter/WoT case, load balancing would be a problem; one 
solution is to "partition by document and by keyword" in the jargon of RFC4981 
i.e. segment either all documents, or the results from a single keyword, into 
manageable-sized groups which can be inserted to different locations. The 
latter gives us sane sized groupings, and the different sub-keywords could be 
indicated in the btree - we fetch from the btree, if the matches are small we 
just get the data; if the matches are big we get a list of keyword hashes which 
we then search for with the above mechanism (this will be faster than 
downloading and intersecting multi-gigabyte index files).


For the time being, btree indexes and WoT are the way to go. There is much that 
can be done to improve on them. On-network intersection is stuff to consider 
much later on e.g. post-1.0. Hence my posting it to tech at .
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/tech/attachments/20100720/6ebf854d/attachment.pgp>

[Tech] Semi-distributed searching (and maybe fully distributed too)

Reply via email to