[freenet-dev] Should the spider ignore common words?

Evan Daniel Wed, 10 Jun 2009 04:06:40 -0400

On Wed, Jun 10, 2009 at 3:49 AM, Daniel Cheng<j16sdiz+freenet at gmail.com> 
wrote:
> On Wed, Jun 10, 2009 at 3:18 PM, Evan Daniel<evanbd at gmail.com> wrote:
>> On Wed, Jun 10, 2009 at 2:56 AM, Daniel Cheng<j16sdiz+freenet at gmail.com> 
>> wrote:
>>> On Wed, Jun 10, 2009 at 2:06 PM, Evan Daniel<evanbd at gmail.com> wrote:
>>>> On Wed, Jun 10, 2009 at 1:54 AM, Daniel Cheng<j16sdiz+freenet at 
>>>> gmail.com> wrote:
>>>>> On Wed, Jun 10, 2009 at 12:02 PM, Evan Daniel<evanbd at gmail.com> wrote:
>>>>>> On my (incomplete) spider index, the index file for the word "the" (it
>>>>>> indexes no other words) is 17MB. ?This seems rather large. ?It might
>>>>>> make sense to have the spider not even bother creating an index on a
>>>>>> handful of very common words (the, be, to, of, and, a, in, I, etc).
>>>>>> Of course, this presents the occasional difficulty:
>>>>>> http://bash.org/?514353 ?I think I'm in favor of not indexing common
>>>>>> words even so.
>>>>>
>>>>> Yes, it should ignore common words.
>>>>> This is called "stopword" in search engine termology.
>>>>>
>>>>>>
>>>>>> Also, on a related note, the index splitting policy should be a bit
>>>>>> more sophisticated: in an attempt to fit within the max index size as
>>>>>> configured, it split all the way down to index_8fc42.xml. ?As a
>>>>>> result, the file index_8fc4b.xml sits all by itself at 3KiB. ?It
>>>>>> contains the two words "vergessene" and "txjmnsm". ?I suspect it would
>>>>>> have reliability issues should anyone actually want to search either
>>>>>> of those. ?It would make more sense to have all of index_8fc4 in one
>>>>>> file, since it would be only trivially larger. ?(I have a patch that I
>>>>>> thought did that, but it has a bug; I'll test once my indexwriter is
>>>>>> finished writing, since I don't want to interrupt it by reloading the
>>>>>> plugin.)
>>>>>
>>>>> "trivially larger" ...
>>>>> ugh... how trivial is trivial?
>>>>>
>>>>> the xmllibrarian can handle ?index_8fc42.xml on its own but all other
>>>>> 8fc4 on ?index_8fc4.xml.
>>>>> however, as i have stated in irc, that make index generation even slower.
>>>>
>>>> 8fc42 is 17382 KiB. ?All other 8fc4 are 79 KiB combined.
>>>>
>>>> Also, it would make index generation faster. ?The spider first does
>>>> all the work of creating 8fc4, then discards it to recreate the
>>>> sub-indexes. ?The vast majority of this work is in 8fc42, which gets
>>>> created twice. ?Not splitting the index would nearly halve the time to
>>>
>>> It don't get created twice, it shortcut early.
>>> see the estimateSize variable in IndexWriter.
>>
>> Unless I'm mistaken, the slow part of the index creation is the
>> term.getPages() call. ?That call is where all the disk io hides, no?
>
> no :)
> getPages() return a IPersistentSet (ScalableSet) which is lazy evaluated.
>
> Internally, it is a linkedset when small, btree when large.
> the .size() method is always cached.


In this case, I don't think it helps.  13 bytes is a gross
underestimate of the size adding a page adds to the file.
estimateSize isn't checked again until all the pages have been added.

Furthermore, that leaves the timing unexplained.  It takes as long to
generate b70 as all the rest of b7* combined.  This is fairly
consistent across the whole set of files (obviously some variation is
present).

2009-06-10 02:59 index_b6e.xml
2009-06-10 03:00 index_b6f.xml
2009-06-10 03:16 index_b70.xml
2009-06-10 03:17 index_b71.xml
2009-06-10 03:18 index_b72.xml
2009-06-10 03:19 index_b73.xml
2009-06-10 03:20 index_b74.xml
2009-06-10 03:21 index_b75.xml
2009-06-10 03:21 index_b76.xml
2009-06-10 03:22 index_b77.xml
2009-06-10 03:24 index_b78.xml
2009-06-10 03:24 index_b79.xml
2009-06-10 03:25 index_b7a.xml
2009-06-10 03:27 index_b7b.xml
2009-06-10 03:28 index_b7c.xml
2009-06-10 03:28 index_b7d.xml
2009-06-10 03:29 index_b7e.xml
2009-06-10 03:30 index_b7f.xml
2009-06-10 03:45 index_b80.xml
2009-06-10 03:47 index_b81.xml

Evan Daniel

[freenet-dev] Should the spider ignore common words?

Reply via email to