On 11/6/2009 20:16, Mike Bush wrote:
> 2009/6/10 Daniel Cheng<j16sdiz+freenet at gmail.com>:
[...]
>>
>> This is yet another reason to split the<site> part out.
>
> I've built 2 indexes to find the space saving from separating keys
> from words as well,
> for an index> 16000 keys with 256 subindices :
>
> The normal index with keys integrated in files>400MB
> With keys in a separate key index(3MB) it totals 160MB
>
> Of course the difference wouldn't be so large if the index wasn't
> separated into so many pieces.
>
> One thing I worried about was that the file index would get very
> large, but even for the key index to be bigger than one of wanna's
> subindexes it would contain> 320000 keys. How many keys do very large
> indexes have?
For a starter idea,
try to split the <site> into multiple files..
site_XXXX.xml
where
XXXX is the prefix of MD5( SSK@/CHK@ of the site )
take the MD5 of the key, but _NOT THE DOC PATH_.
This would have the following advantage:
- the file would compress better
- USK@ edition would be grouped together
* USK Edition based magics are easier.
* Words across multiple edition would look simliar,
grouping means lessor site file to fetch
>
> MikeB