So if words change in a page which is being reindexed, shouldn't the 
positions of words in the word->url table be changed as well?
If I read method 2 correctly, this implies this isn't done....


2-7-02 16:29:16, Kir Kolyshkin <[EMAIL PROTECTED]> wrote:

>yayaivan wrote:
>> 
>> Hi,
>> 
>> I don't use citation. Because it take a lot of disk space, I delete everything
>> from citation table and set "IncrementalCitations no" in aspseek.conf and 
>searchd.conf
>
>I wonder who told you that you can do so?
>
>> But now, indexer is runing in strange manner. After finishing indexing
> sites, I notice that some processes still work, and they are inserting
>> data in citation table:(
>> I look in conf files, and notice only now this : "You MUST NOT
>> change value of this parameter for not empty database". but I already did it:(
>> How can I now correctly stop indexer work with citations?
>
>No way. Cached copy of each file is needed for correct reindexing
>of the page. Let's assume that you have a page with two words in
>it: "memory" and "penny". Upon the first indexing, its compressed
>cached copy is saved in the database, and when an URL_ID is assigned
>to the page (let's assume it is 101).
>
>Next, words are saved into inverted index: word -> urls. So, we have two
>records in wordurl table:
>
>....
>memory -> 101
>....
>penny -> 101
>
>(Actually the word position and some other info is saved together with
>URL_ID, but I will skip it here for clarity).
>
>Now note that the words "memory" and "penny" can appear not only
>in this page, but on the many other pages as well. And there are
>a countless number of words. So actually we do end up with a very
>big table.
>
>During the next reindexing, if the document is changed, we need to
>clear the works that are no longer in the document, and add new words.
>This can be done in two ways:
>
>1. Remove URL_ID 101 from all tables, and add all words.
>   This is very inefficient because finding all occurences of 101
>   in all wordurls can take several minutes
>
>2. Find out what words have disappeared from the page and are
>   to be deleted, and what new words are found in the page and
>   are to be inserted.
>
>Method number 2 is more practical, but we need to know what words
>were in the document when it was indexed previous time. Again,
>scanning all wordurl records is way too long.
>
>That's why aspseek saves a copy of the page indexed, and uses
>it upon reindexing to create a "delta" (changes) between
>two versions of the page. If you have deleted this copy,
>index is just not able to work any more.
>
>And last, but not least. Option "IncrementalCitation" does not
>switch saving a cached copy of the document. It just turns on
>a special enchanced more of reindexing which is faster and requires
>less memory, but is not compatible with aspseek-1.0 format.
>So is is here just for backward compatibility, and probably
>will be removed in aspseek-1.3.
>
>-- [EMAIL PROTECTED] ICQ UIN 7551596 Phone +7 903 6722750 --
>   Guinness a Day Keeps a Doctor Away (people's wisdom)
>



Reply via email to