[freenet-dev] Should the spider ignore common words?

Mike Bush Thu, 11 Jun 2009 15:15:35 +0100

On Thu, 2009-06-11 at 21:25 +0800, Daniel Cheng wrote:
> On 11/6/2009 20:16, Mike Bush wrote:
> > 2009/6/10 Daniel Cheng<j16sdiz+freenet at gmail.com>:
> [...]
> >>
> >> This is yet another reason to split the<site>  part out.
> >
> > I've built 2 indexes to find the space saving from separating keys
> > from words as well,
> >   for an index>  16000 keys with 256 subindices :
> >
> > The normal index with keys integrated in files>400MB
> > With keys in a separate key index(3MB) it totals 160MB
> >
> > Of course the difference wouldn't be so large if the index wasn't
> > separated into so many pieces.
> >
> > One thing I worried about was that the file index would get very
> > large, but even for the key index to be bigger than one of wanna's
> > subindexes it would contain>  320000 keys. How many keys do very large
> > indexes have?
> 
> For a starter idea,
> try to split the <site> into multiple files..
> 
>      site_XXXX.xml
> where
>      XXXX is the prefix of MD5( SSK@/CHK@ of the site )
> 
> take the MD5 of the key, but _NOT THE DOC PATH_.
> This would have the following advantage:
> 
>     - the file would compress better
> 
>     - USK@ edition would be grouped together
>         * USK Edition based magics are easier.
>         * Words across multiple edition would look simliar,
>           grouping means lessor site file to fetch
>


I would imagine that splitting the site index would be futile though, if
it was only split into a few, for example 16 files, a typical search
result of many hundreds of results would still require most parts. On
the other hand, a large number of splits would mean a smaller proportion
could be requested but the large number of requests would slow it
further.

[freenet-dev] Should the spider ignore common words?

Reply via email to