So if words change in a page which is being reindexed, shouldn't the positions of words in the word->url table be changed as well? If I read method 2 correctly, this implies this isn't done....
2-7-02 16:29:16, Kir Kolyshkin <[EMAIL PROTECTED]> wrote: >yayaivan wrote: >> >> Hi, >> >> I don't use citation. Because it take a lot of disk space, I delete everything >> from citation table and set "IncrementalCitations no" in aspseek.conf and >searchd.conf > >I wonder who told you that you can do so? > >> But now, indexer is runing in strange manner. After finishing indexing > sites, I notice that some processes still work, and they are inserting >> data in citation table:( >> I look in conf files, and notice only now this : "You MUST NOT >> change value of this parameter for not empty database". but I already did it:( >> How can I now correctly stop indexer work with citations? > >No way. Cached copy of each file is needed for correct reindexing >of the page. Let's assume that you have a page with two words in >it: "memory" and "penny". Upon the first indexing, its compressed >cached copy is saved in the database, and when an URL_ID is assigned >to the page (let's assume it is 101). > >Next, words are saved into inverted index: word -> urls. So, we have two >records in wordurl table: > >.... >memory -> 101 >.... >penny -> 101 > >(Actually the word position and some other info is saved together with >URL_ID, but I will skip it here for clarity). > >Now note that the words "memory" and "penny" can appear not only >in this page, but on the many other pages as well. And there are >a countless number of words. So actually we do end up with a very >big table. > >During the next reindexing, if the document is changed, we need to >clear the works that are no longer in the document, and add new words. >This can be done in two ways: > >1. Remove URL_ID 101 from all tables, and add all words. > This is very inefficient because finding all occurences of 101 > in all wordurls can take several minutes > >2. Find out what words have disappeared from the page and are > to be deleted, and what new words are found in the page and > are to be inserted. > >Method number 2 is more practical, but we need to know what words >were in the document when it was indexed previous time. Again, >scanning all wordurl records is way too long. > >That's why aspseek saves a copy of the page indexed, and uses >it upon reindexing to create a "delta" (changes) between >two versions of the page. If you have deleted this copy, >index is just not able to work any more. > >And last, but not least. Option "IncrementalCitation" does not >switch saving a cached copy of the document. It just turns on >a special enchanced more of reindexing which is faster and requires >less memory, but is not compatible with aspseek-1.0 format. >So is is here just for backward compatibility, and probably >will be removed in aspseek-1.3. > >-- [EMAIL PROTECTED] ICQ UIN 7551596 Phone +7 903 6722750 -- > Guinness a Day Keeps a Doctor Away (people's wisdom) >
