On 11/6/2009 20:16, Mike Bush wrote:
> 2009/6/10 Daniel Cheng<j16sdiz+freenet at gmail.com>:
[...]
>>
>> This is yet another reason to split the<site>  part out.
>
> I've built 2 indexes to find the space saving from separating keys
> from words as well,
>   for an index>  16000 keys with 256 subindices :
>
> The normal index with keys integrated in files>400MB
> With keys in a separate key index(3MB) it totals 160MB
>
> Of course the difference wouldn't be so large if the index wasn't
> separated into so many pieces.
>
> One thing I worried about was that the file index would get very
> large, but even for the key index to be bigger than one of wanna's
> subindexes it would contain>  320000 keys. How many keys do very large
> indexes have?

For a starter idea,
try to split the <site> into multiple files..

     site_XXXX.xml
where
     XXXX is the prefix of MD5( SSK@/CHK@ of the site )

take the MD5 of the key, but _NOT THE DOC PATH_.
This would have the following advantage:

    - the file would compress better

    - USK@ edition would be grouped together
        * USK Edition based magics are easier.
        * Words across multiple edition would look simliar,
          grouping means lessor site file to fetch


>
> MikeB

Reply via email to