In between me, evanbd and infinity0, we have discovered that we can *dramatically* speed up searching. Here's the deal: - The top of an index is an SSK block. Since we subscribe to the index, this will get prefetched. - Each level of the btree contains the metadata for the layer below. But this is distributed into nodes in such a way that you don't need to fetch the whole layer, just the node. - At each level of a btree-based index, we have a block-level fan-out of approximately 512. This will be less in practice, because e.g. we are using b-trees not b+trees (eventually infinity0 will fix this), and because of the (small) space needed for the keys between the nodes, but not dramatically less. Lets say 128. - So the first layer is an SSK, which is always prefetched. The second layer is a single splitfile of 128K-512K. We should always prefetch this too. - The third layer is something between 16MB and 256MB. This should be enough for many indexes. Four layers should be enough for ANY index, at 2GB to 128GB. - We can make a node be a single block, 32KB. Nodes are aggregated into splitfiles of say 16+16 blocks, but in the average case for a popular index we would fetch the specific node we want, plus a randomly chosen check block in that splitfile, plus a randomly chosen data or check block from within the same node. - The average case for a popular index, at least in the third layer, is then that the block is fetched the first time. The other two fetches are just for propagation and are not on the critical path. - So even for a big index, a fetch could reasonably take *TWO DEPENDANT CHK FETCHES* on average (and two 16+16 splitfile fetches in the worst case). I.e. it could be *fast*.
Once we reach the bottom of the btree, we have more work to do. Depending on how popular the data being searched for is, we may have to fetch one or more blocks for its index. If we are only fetching a single word, we don't need to download the full data, although there are redundancy issues here. If we are fetching multiple words we probably don't need the full data for both of them either. And the spider should group commonly adjacent words in its search data if possible - e.g. [ unpopular word ] [ stop word ] should be indexed. Obviously none of this is possible without infinity0's work on on-freenet btrees last summer. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20100415/6d1649b0/attachment.pgp>
