[freenet-dev] Should the spider ignore common words?

Daniel Cheng Thu, 11 Jun 2009 22:26:23 +0800

On Thu, Jun 11, 2009 at 10:15 PM, Mike Bush<mpbush at gmail.com> wrote:
> On Thu, 2009-06-11 at 21:25 +0800, Daniel Cheng wrote:
>> On 11/6/2009 20:16, Mike Bush wrote:
>> > 2009/6/10 Daniel Cheng<j16sdiz+freenet at gmail.com>:
>> [...]
>> >>
>> >> This is yet another reason to split the<site> ?part out.
>> >
>> > I've built 2 indexes to find the space saving from separating keys
>> > from words as well,
>> > ? for an index> ?16000 keys with 256 subindices :
>> >
>> > The normal index with keys integrated in files>400MB
>> > With keys in a separate key index(3MB) it totals 160MB
>> >
>> > Of course the difference wouldn't be so large if the index wasn't
>> > separated into so many pieces.
>> >
>> > One thing I worried about was that the file index would get very
>> > large, but even for the key index to be bigger than one of wanna's
>> > subindexes it would contain> ?320000 keys. How many keys do very large
>> > indexes have?
>>
>> For a starter idea,
>> try to split the <site> into multiple files..
>>
>> ? ? ?site_XXXX.xml
>> where
>> ? ? ?XXXX is the prefix of MD5( SSK@/CHK@ of the site )
>>
>> take the MD5 of the key, but _NOT THE DOC PATH_.
>> This would have the following advantage:
>>
>> ? ? - the file would compress better
>>
>> ? ? - USK@ edition would be grouped together
>> ? ? ? ? * USK Edition based magics are easier.
>> ? ? ? ? * Words across multiple edition would look simliar,
>> ? ? ? ? ? grouping means lessor site file to fetch
>>
>
> I would imagine that splitting the site index would be futile though, if
> it was only split into a few, for example 16 files, a typical search
> result of many hundreds of results would still require most parts. On
> the other hand, a large number of splits would mean a smaller proportion
> could be requested but the large number of requests would slow it
> further.



Possible Sol'n:
(NOTE TO mikeb: You don't have to implement this in this version;
 this can be another summer :))

   * Load them lazily.
   - Splitting the result into pages.
   - Include all the stats related to ranking in the keyword index file,
      that would be the term position.  so we can do TF-IDF kind of ranking
   - Prefetch the site files for other pages in the background.

[freenet-dev] Should the spider ignore common words?

Reply via email to