Sander Bokhorst wrote: > > So if words change in a page which is being reindexed, shouldn't the > positions of words in the word->url table be changed as well? > If I read method 2 correctly, this implies this isn't done....
This was, well, a very simple description without getting into all the gory details. Yes, surely the position is changed. Actually this is a list of positions for each URL. > > 2-7-02 16:29:16, Kir Kolyshkin <[EMAIL PROTECTED]> wrote: > > >yayaivan wrote: > >> > >> Hi, > >> > >> I don't use citation. Because it take a lot of disk space, I delete everything > >> from citation table and set "IncrementalCitations no" in aspseek.conf and >searchd.conf > > > >I wonder who told you that you can do so? > > > >> But now, indexer is runing in strange manner. After finishing indexing > > sites, I notice that some processes still work, and they are inserting > >> data in citation table:( > >> I look in conf files, and notice only now this : "You MUST NOT > >> change value of this parameter for not empty database". but I already did it:( > >> How can I now correctly stop indexer work with citations? > > > >No way. Cached copy of each file is needed for correct reindexing > >of the page. Let's assume that you have a page with two words in > >it: "memory" and "penny". Upon the first indexing, its compressed > >cached copy is saved in the database, and when an URL_ID is assigned > >to the page (let's assume it is 101). > > > >Next, words are saved into inverted index: word -> urls. So, we have two > >records in wordurl table: > > > >.... > >memory -> 101 > >.... > >penny -> 101 > > > >(Actually the word position and some other info is saved together with > >URL_ID, but I will skip it here for clarity). > > > >Now note that the words "memory" and "penny" can appear not only > >in this page, but on the many other pages as well. And there are > >a countless number of words. So actually we do end up with a very > >big table. > > > >During the next reindexing, if the document is changed, we need to > >clear the works that are no longer in the document, and add new words. > >This can be done in two ways: > > > >1. Remove URL_ID 101 from all tables, and add all words. > > This is very inefficient because finding all occurences of 101 > > in all wordurls can take several minutes > > > >2. Find out what words have disappeared from the page and are > > to be deleted, and what new words are found in the page and > > are to be inserted. > > > >Method number 2 is more practical, but we need to know what words > >were in the document when it was indexed previous time. Again, > >scanning all wordurl records is way too long. > > > >That's why aspseek saves a copy of the page indexed, and uses > >it upon reindexing to create a "delta" (changes) between > >two versions of the page. If you have deleted this copy, > >index is just not able to work any more. > > > >And last, but not least. Option "IncrementalCitation" does not > >switch saving a cached copy of the document. It just turns on > >a special enchanced more of reindexing which is faster and requires > >less memory, but is not compatible with aspseek-1.0 format. > >So is is here just for backward compatibility, and probably > >will be removed in aspseek-1.3. > > > >-- [EMAIL PROTECTED] ICQ UIN 7551596 Phone +7 903 6722750 -- > > Guinness a Day Keeps a Doctor Away (people's wisdom) > > -- [EMAIL PROTECTED] ICQ UIN 7551596 Phone +7 903 6722750 -- Guinness a Day Keeps a Doctor Away (people's wisdom)
