Your proposed network is really just a tree, from which you're
picking some nearby nodes to use to distribute queries to the nodes below them.
In practice, it's not substantially different from querying the root node.
(In fact it's about 99% identical.)
You're failing to recognize the hierarchy, but it's still there.
And since you're still proposing to forward queries to everybody in the tree,
whether directly or less directly, "your message may cost the net
hundreds if not thousands of dollars" to distribute,
and you'll have performance problems because you're constrained by the
slowest and least reliable links in the networks.

Supernode networks build hierarchies to avoid these problems.
A supernode connects to N regular nodes and summarizes their index information,
and normally there's some discovery mechanism to find a good nearby 
supernode, plus a
mechanism to make sure that supernodes have better bandwidth than leaf nodes.
Assuming that a query contains about as much information as a database 
index item*,
if the number of queries a leaf node makes exceeds the number of items in 
its database,
it wins by letting a supernode handle queries to its index.

More importantly, instead of the network carrying N queries on slow links,
it only needs to send one query to the supernode on a probably-faster 
connection,
reducing the delay time required for a successful or unsuccessful response.
In networks where the total bandwidth of the N leaf nodes exceeds the
capacity of the supernode (e.g. 100 56kb modems vs. a 1.5Mbps T1 or 384k DSL),
the effective bandwidth of the leaf nodes is even slower,
so the impact on delay time is even larger.  Since the index uploads
don't all happen simultaneously or on the critical path for queries,
unlike downward search requests, this oversubscription isn't a problem.

If the supernode summarizes duplicate responses intelligently,
it can also significantly reduce the traffic volume from popular items,
since it doesn't have to report *every* copy of that Metallica song.

There are still scale questions for supernode networks -
How big is N?  How many items does an average leaf node store?
How many index items can a supernode reasonably store?
How fast is a query - you don't want to thrash the supernode?
How fast do leaf nodes come and go on a given supernode?
How fast do index items change on a given leaf or supernode?
Should a leaf node index itself on multiple supernodes for reliability?
Should there also be super-supernodes to summarize the supernodes?
How do you set up a good balanced discovery process for supernodes?

A typical setup might have 100-byte index entries, 1000 items/node,
and 100 nodes/supernode giving about 10MB of index if all the items are unique,
or maybe 2-5MB if lots of items are duplicates and lots of names are short.
1000 nodes/supernode is 20-50-100MB, which is still not bad on a 5GB iPod.



----
* This isn't always strictly true, but it's pretty close.
Modem upstream is slower than modem downstream, but responses to
successful queries have to be transmitted upstream as well.

Reply via email to