In between me, evanbd and infinity0, we have discovered that we can 
*dramatically* speed up searching. Here's the deal:
- The top of an index is an SSK block. Since we subscribe to the index, this 
will get prefetched.
- Each level of the btree contains the metadata for the layer below. But this 
is distributed into nodes in such a way that you don't need to fetch the whole 
layer, just the node.
- At each level of a btree-based index, we have a block-level fan-out of 
approximately 512. This will be less in practice, because e.g. we are using 
b-trees not b+trees (eventually infinity0 will fix this), and because of the 
(small) space needed for the keys between the nodes, but not dramatically less. 
Lets say 128.
- So the first layer is an SSK, which is always prefetched. The second layer is 
a single splitfile of 128K-512K. We should always prefetch this too.
- The third layer is something between 16MB and 256MB. This should be enough 
for many indexes. Four layers should be enough for ANY index, at 2GB to 128GB.
- We can make a node be a single block, 32KB. Nodes are aggregated into 
splitfiles of say 16+16 blocks, but in the average case for a popular index we 
would fetch the specific node we want, plus a randomly chosen check block in 
that splitfile, plus a randomly chosen data or check block from within the same 
node.
- The average case for a popular index, at least in the third layer, is then 
that the block is fetched the first time. The other two fetches are just for 
propagation and are not on the critical path.
- So even for a big index, a fetch could reasonably take *TWO DEPENDANT CHK 
FETCHES* on average (and two 16+16 splitfile fetches in the worst case). I.e. 
it could be *fast*.

Once we reach the bottom of the btree, we have more work to do. Depending on 
how popular the data being searched for is, we may have to fetch one or more 
blocks for its index. If we are only fetching a single word, we don't need to 
download the full data, although there are redundancy issues here. If we are 
fetching multiple words we probably don't need the full data for both of them 
either. And the spider should group commonly adjacent words in its search data 
if possible - e.g. [ unpopular word ] [ stop word ] should be indexed.

Obviously none of this is possible without infinity0's work on on-freenet 
btrees last summer.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20100415/6d1649b0/attachment.pgp>

Reply via email to