On Fri, Nov 11, 2011 at 9:58 AM, nobre <rafael.ro...@novaprolink.com.br> wrote: > Hi, I'm studying the indexing mechanism of FTS3/4, I can pretty much > understand how doclists, terms, segments are created and stored, but one > thing I can't grasp is about updating and deleting docs and keeping up the > index up to date. From the source comments: > > [quote] > ** Since we're using a segmented structure, with no docid-oriented > ** index into the term index, we clearly cannot simply update the term > ** index when a document is deleted or updated. For deletions, we > ** write an empty doclist (varint(docid) varint(POS_END)), for updates > ** we simply write the new doclist. Segment merges overwrite older > ** data for a particular docid with newer data, so deletes or updates > ** will eventually overtake the earlier data and knock it out. The > ** query logic likewise merges doclists so that newer data knocks out > ** older data. > [/quote] > > Its clear to me that with the way things are stored, it would be crazy to > update all doclists with matches related to a single docid. > I just don't see how a segment merge can possibly know which doclist is > older/newer, other than by the level. What happens when a document stored in > a level 0 segment (recently inserted) is updated or deleted? Which one will > be kept and go up to a level 1 segment?
The lower-level segments are always newer, and the lower-numbered segments within a level are always older. So if you have a transaction which adds a document (causing a level-0 segment), then another which deletes that document (another level-0 segment), when the level-0 segments are merged into a level-1 segment the references to the original insert will be overwritten by the references to the delete. If a query happens before that merge, all of the doclists for the query term will merge together and the same thing will happen. Unfortunately, due to how things work those deletes have to stick around, even after they catch to originals. The problem is how updates are implemented. If you did an insert, an update, and a delete on a document, then the update can catch the insert and overwrite it, then the delete can catch the update, and all is fine. But if the delete catches the update first, the delete has to be retained because there could be an even earlier record in the system. One could conceive of a very similar system where deletes destruct with the original inserts, but it would require slight (format-incompatible) tweaks to the storage format. Or one could track documents in the system more explicitly and purge deletions that way. [Apologies if my fuzzy memory made this paragraph a fuzzy explanation.] -scott _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users