[Tech] Semi-distributed searching (and maybe fully distributed too)

Matthew Toseland Mon, 19 Jul 2010 21:55:26 +0100

FASD paper:
http://freenetproject.org/papers/kronfol_final_thesis.pdf

The FASD paper enables keyword searching but is spammable. Basically this is
really easy (I haven't read it in a while so my terminology may be completely
different to his):
- Keywords are hashed, of course, and these hashes translate to routing
locations.
- An index element has a list of keywords connected to it.
- These are inserted as normal keys, except that they are inserted to the
locations of *each* of the keywords.
- A search can be routed to all of the keyword locations or to some subset of
them. It will include a list of keywords.
- The receiving nodes will send back any index elements that match the search.
- Standard algorithms for relevance are used to rank results (based on relative
frequency etc). This can easily be extended for arbitrary insert-time relevance
numbers.
- I don't recall whether flow control was specified, it could become a problem,
but the obvious solution is just to limit the number of returned items.

The fundamental problem has always been spam: These are effectively KSKs, and
when you fetch them you propagate all the spam that the spammer has inserted.

So solve it the same way we solve this for chat. All we have to do to make FASD
feasible is:
- Make the items signed.
- Send a compressed list of the inserter identities we are interested in along
with the request.

Clearly scalability is limited in much the same way as it is for chat. However
it should be workable provided there is a certain amount of aggregation and
centralisation at the end-user level. And it's a big improvement from having to
download several huge indexes and intersect them on the client. Arguably it may
be more efficient to use this mechanism only for popular keywords; for
unpopular keywords the btree indexes should be very efficient. And it's
probably a useful primitive for many applications...

Could it be made fully distributed, without needing a WoT? Yes, but only with:
1) Some sort of reputation feedback mechanism. This most likely involves not
propagating, or de-propagating, content that was fetched and not wanted, and
2) A sufficiently high cost for inserts of keywords to deter spamming. E.g.
something similar to the Scarce KSKs proposal - limit the number of keyword
index items that can be inserted to one per X seconds per link.

These are in principle feasible but may open serious attacks e.g. the first may
introduce censorship attacks, links to the wider question of is it safe ever to
allow to discover the location of a document without propagating it. IMHO the
main issue there is path folding, and we can turn path folding off for keyword
searches (just like we turn it off for SSKs and inserts). Plus, will users give
sufficient useful feedback?

KISS: Simplest mechanism: Keep an LRU (random replacement better for data
storage but not here). When a fetch happens, generate an unforgeable token for
each returned datum. If the user gives positive feedback, this is sent back to
us, we "spend" the token (i.e. commit it to persistent storage, it cannot be
spent twice, like digicash), and promote the item. This would be an operation
similar to an insert, which would promote on those nodes where it matched the
original fetch. Obviously it will only work for a relatively short period after
the original request. We might have recently fetched items stored in a special
cache to be promoted into the main cache if feedback comes in, and dropped if
it doesn't. To prevent abuse, we could perhaps use rate limiting similar to
that proposed for inserting? A spammer keeping stuff in the cache indefinitely
without anyone else fetching it may be possible by repeatedly requesting (he
can't repeatedly insert) - but not if we don't promote items that we have
already returned from a node that was close to the originator. Or we just
impose rate limiting on search requests as well as inserts. Or ignore it - spam
is only a problem if it is in volume, and inserts are throttled anyway.

Hmmm, what exactly does FASD say about this? My recollection was that FASD was
spammable. I should really read the paper again! It looks like it *might* be a
tractable problem ...

And can these mechanisms prevent spam on chat too?? Yes, chat is essentially a
search problem.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL:
<https://emu.freenetproject.org/pipermail/tech/attachments/20100719/1ea13819/attachment.pgp>

[Tech] Semi-distributed searching (and maybe fully distributed too)

Reply via email to