[Freenet-dev] Searching Proposal

Brandon Tue, 11 Apr 2000 17:04:47 -0500 (CDT)

This message is quite long. Just to warn you.

---------------------------------


Why A Broadcast-based Search Method Will Work On Freenet
Author: Brandon Wiley (blanu at uts.cc.utexas.edu)

This is a proposal for a searching mechanism for Freenet. The different
between this method and others is that it does not attempt to use normal
Freenet message routing (based on a closeness metric for the key) to
optimize searches. It is done by adjusting the mechanisms of Freenet
(caching and routing to where a probable hit will be) to work with
searches. Searches are fundamentally different than requests.

The important thing to remember about broadcast searches is that, in the
worst case, even smart searching methods (and data requests) turn into a
broadcast search. However, in the best case an intelligently routed
message will take a lot less time and produce a lot less load on the
network than a search of the entire network. This proposal specifies how
to make searches have this same property, without any tricks to route them
based on a key-based closeness metric. The following points will be
covered:

Problems:
  Scalability
  Flooding Attacks
  Dumb Routing

Benefits:
  Unlimited potential for search types
    Regular expression matches
    Keywords
    Structured metadata queries
  Simplicity, no tricks
  Definitely works


Scalability:
        It has been said that a broadcast based search method is not
scalable. This is in reference to the exponential growth of the number of
requests as the search request propagates. This means that a message with
a large HTL would go to many more nodes than a DataRequest with the same
HTL. This, however, is not a bad thing. The value of a searching mechanism
should be evaluated according to its accuracy, thouroughness, speed, and
load on the network. Comparing a BSR (Broadcast Search Request) with a SSR
(Smart Search Request), the accuracy and thouroughness of an SSR is
determined by the routing mechanism (therefore, probably, the closeness
metric). The accuracy and thouroughness of a BSR is determined entirely by
its HTL. These values can therefore be increased by the user, embracing
our philosophy of letting the user decide when it comes to issues of
tradeoffs. The tradeoffs, in this case are accuracy and thouroughness
vs. speed. This is because a BSR, because of the exponential growth of
messages, creates more load on the network than a SSR, which has message
growth of O(1). While this may seem like a clear advantage to SSRs, it's
not really that big of a deal, and the advantages of BSRs are great. The
reason it's not a big deal is that BSRs don't have to increase the load
exponentially, only if they want to be executed in constant time. The
solution to exponential system load is to treat search requests as less
important than smart messages such as DataRequests.

There are two ways to deal with this problem.

1) Prioritize BSRs by their HTL, giving lower priortiy to higher HTLs and
giving BSRs lower priority than smart messages. Also, keep them in their
own queue so that they don't compete for message memory space with other
messages. This means that messages with large HTLs (high network
load) will be delayed while other messages are processed, effectively
reducing the load to that of other messages. Also, if they have their own
queue, a flood of broadcast messages with high HTLs, intended to DOS the
network will just knock each other off the queue. This will give the user
the choice of a tradeoff between thouroughness and speed. Small HTLs will
finish quickly. Large HTLs will give more results. The system load will be
constant. The accuracy of the search will be complete, within a given
number of hops (as in all matches within a given number of hops will be
reported).

2) Instead of broadcasting to all known nodes, broadcast to a random node.
This will give a small slice of results. If the user wants a more
thourough search, he can resubmit. Nodes should remember past BSR paths
and be able to rule out nodes which the BSR has been previously sent
to. It should forward the message as it previously did until the node it
is sending to sends back a RequestFailed (out of alternative references),
in which case it should mark that node as exhausted and send to other
nodes in the future. This is basically just an exhaustive search of the
part of the network reachable from a certain HTL. It is therefore a
broadcast-based search. However, it is broken down into one request per
path. This means that no new mechanism need be added for dealing with
floods. Each BSR has O(1) message growth. Whatever mechanisms for
preventing flooding that are put into normal messages can worked,
unchanged, for BSRs. The disadvantage of this method over method 1 is that
it will be slower, taking many more messages to do an exhaustive
search. One slight enhancement that would probably be beneficial is to
cause a BSR to broadcast to all known nodes when the HTL is 1. This would
greatly reduce the number of overall messages sent while not allowing full
exponential growth of the number of messages sent per BSR.

What this method does, effectively, is allow the user to conrol the
thouroughness of his search by resubmitting his search until he has enough
matches to be satisfied. This can, of course, be automated by the
client. Also, no new mechanism need be added to deal specifically with
BSR-based floods.


Flooding Attacks:
  The problem with flooding is that it produces enough load on the system
to keep normal requests from getting through. By the two solutions to load
distribution (over time) for searches, the potential for flood attacks is
reduced to that of normal messages.


Dumb Routing:
  All of this load distribution comes at a price, that of speed. A BSR
will be slower than an SSR because, without smart routing, the finding of
information will not be optimized by routing. However, this is okay, for
two reasons. First, although a SSR would have optimized routing, it
probably wouldn't have all of the other benefits of a BSR (covered in the
Benefits section). So optimized routing doesn't make an SSR better, it's
just one advantage that it has while it lacks many others. True, it is not
the Freenet way to use brute force searching; it is the Freenet way to use
optimized routing for everything. However, all of the abilities that BSRs
give to Freenet searches are very important, perhaps even indispensable.
Also, BSRs _can_ be optimized, just not with routing. The way to make BSRs
fast is to make them shallow. With either of the above load distributing
methods, shallow BSRs will be significantly faster than deep ones. The way
this this can be achieved is with caching. Just as Freenet caches data
that is requested, it should cache search results that match queries. This
way, popularly searched for keys will be cached, reducing the depth of
searches. Popular data should be not only fast to retrieve, but also fast
to find. The size of these caches may seem problematic. With a really
large Freenet, there will be so very many keys. The caches would have to
be huge. However, this is not necessarily true. A cache only remembers
popular keys. Searches that match popular keys will return very
quickly. Searches that get few matches will not find matches in the cache
and fall back to a normal BSR as previously outlined. However, when the
search reply is recieved it will at all of the keys being return to the
cache. A future search which matches these keys will return instantly.

What is happening here is that popular keys are return quickly while
unpopular keys take longer to be found. This is the same effect that
Freenet has only data, even though the mechanism are different.

Another thing about caching to consider is that a cache absolutely does
not need to have a list of all keys in Freenet. Requests only search the
area within a certain HTL anyway. What is desired is not a comprehensive
list, but a list of everything which is reasonably available, which is
significantly smaller.

Also, the key list will be significantly smaller than the size of the data
being contained. If every node designated 10% of its datastore space to
key lists and an entry is 1% the size of the data it represents (probably
a lot less in many case and more in some cases) then a node can contain a
cached key list for 100 nodes worth of information. So the number of hops
before results would be found for a BSR shouldn't be that big.

Of course, it would be quite possible for individuals to take it upon
themselves to run nodes with really large datastores dedicated to holding
mostly keys. A gig of keys is a whole lot of data and a single node could
hold a gig of keys.

This leads to another BSR optimization. BSR routing can be optimized
based on what nodes gives back the most search results. This way, a node
dedicated to serving keys would be automatically detected and given high
priority for BSRs. This is very similar to normal DataRequest
routing. Send the request to where it is likely to find what it is looking
for. In this case, that is keys that match a certain criteria. If that
node fails, it falls to a normal brute force search again, of
course. However, that is what happens with normal requests. If you request
something that is not in the network, the equivalent of a broadcast data
request will occur.

Though the mechanism for optimization (KHK closeness metric vs. historical
reliability) is different, that is because searches and requests are by
nature fundamentally different enterprises. The first is seeking to
accumulate a number of matches while the second is seeking a particular
item. The effect is the same. Caching and routing based on probability of
success are used to increase the efficiency of what is, in the worst case,
a brute force search of the network.


Benefits:

Unlimited Potential for Search Types:
  Unlike other proposals (for SSRs), BSRs allow very flexible
searching. The routing and the type of searching are not
connected. Caching and routing optimized by the number of search results
which nodes have supplied in the past allows for different kinds of
searches to be used. Keys can be matched against regular
expressions. Keywords can be indexed in a keyword list. Most importantly
(and unlike any other system which is similar to Freenet) we can have
structured metadata queries. This is a useful feature. Any search method
that we employ should allow all of the following searches to be made (given
that the users and clients cooperate, of course):

Documents about Freenet
mp3s by the band Phish
Newer than 3:00 April 10, 1999
Signed by blanu at uts.cc.utexas.edu


Simplicity, no tricks:
  This method is straightforward to implement, both in code and in the
protocol. It doesn't require any kind of tricks or compromises to make
things work. All kinds of searches are possible under a single search
framework. No hashing or mangling has to be done.


Definitely works:
  We know this method works. Gnutella uses an unoptimized broadcast model
and Gnutella definitely works. It has thousands and thousands of hosts and
everything works fine. Of course our aim is to be better than
Gnutella. That why we should used optimized broadcasting with caching,
dedicated caching nodes (run at the whim of node owners, of course, and in
no way centralized), and routing of BSRs based on the reliability of
nodes. And of course, we will have structured metadata queries, which puts
us above the level of a file sharing program, onto the level of the web,
or even above.




_______________________________________________
Freenet-dev mailing list
Freenet-dev at lists.sourceforge.net
http://lists.sourceforge.net/mailman/listinfo/freenet-dev

[Freenet-dev] Searching Proposal

Reply via email to