On Thu, 2009-06-11 at 21:25 +0800, Daniel Cheng wrote: > On 11/6/2009 20:16, Mike Bush wrote: > > 2009/6/10 Daniel Cheng<j16sdiz+freenet at gmail.com>: > [...] > >> > >> This is yet another reason to split the<site> part out. > > > > I've built 2 indexes to find the space saving from separating keys > > from words as well, > > for an index> 16000 keys with 256 subindices : > > > > The normal index with keys integrated in files>400MB > > With keys in a separate key index(3MB) it totals 160MB > > > > Of course the difference wouldn't be so large if the index wasn't > > separated into so many pieces. > > > > One thing I worried about was that the file index would get very > > large, but even for the key index to be bigger than one of wanna's > > subindexes it would contain> 320000 keys. How many keys do very large > > indexes have? > > For a starter idea, > try to split the <site> into multiple files.. > > site_XXXX.xml > where > XXXX is the prefix of MD5( SSK@/CHK@ of the site ) > > take the MD5 of the key, but _NOT THE DOC PATH_. > This would have the following advantage: > > - the file would compress better > > - USK@ edition would be grouped together > * USK Edition based magics are easier. > * Words across multiple edition would look simliar, > grouping means lessor site file to fetch >
I would imagine that splitting the site index would be futile though, if it was only split into a few, for example 16 files, a typical search result of many hundreds of results would still require most parts. On the other hand, a large number of splits would mean a smaller proportion could be requested but the large number of requests would slow it further.
