On Fri, Nov 11, 2011 at 9:58 AM, nobre <rafael.ro...@novaprolink.com.br> wrote:
> Hi, I'm studying the indexing mechanism of FTS3/4, I can pretty much
> understand how doclists, terms, segments are created and stored, but one
> thing I can't grasp is about updating and deleting docs and keeping up the
> index up to date. From the source comments:
>
> [quote]
> ** Since we're using a segmented structure, with no docid-oriented
> ** index into the term index, we clearly cannot simply update the term
> ** index when a document is deleted or updated.  For deletions, we
> ** write an empty doclist (varint(docid) varint(POS_END)), for updates
> ** we simply write the new doclist.  Segment merges overwrite older
> ** data for a particular docid with newer data, so deletes or updates
> ** will eventually overtake the earlier data and knock it out.  The
> ** query logic likewise merges doclists so that newer data knocks out
> ** older data.
> [/quote]
>
> Its clear to me that with the way things are stored, it would be crazy to
> update all doclists with matches related to a single docid.
> I just don't see how a segment merge can possibly know which doclist is
> older/newer, other than by the level. What happens when a document stored in
> a level 0 segment (recently inserted) is updated or deleted? Which one will
> be kept and go up to a level 1 segment?

The lower-level segments are always newer, and the lower-numbered
segments within a level are always older.  So if you have a
transaction which adds a document (causing a level-0 segment), then
another which deletes that document (another level-0 segment), when
the level-0 segments are merged into a level-1 segment the references
to the original insert will be overwritten by the references to the
delete.  If a query happens before that merge, all of the doclists for
the query term will merge together and the same thing will happen.

Unfortunately, due to how things work those deletes have to stick
around, even after they catch to originals.  The problem is how
updates are implemented.  If you did an insert, an update, and a
delete on a document, then the update can catch the insert and
overwrite it, then the delete can catch the update, and all is fine.
But if the delete catches the update first, the delete has to be
retained because there could be an even earlier record in the system.
One could conceive of a very similar system where deletes destruct
with the original inserts, but it would require slight
(format-incompatible) tweaks to the storage format.  Or one could
track documents in the system more explicitly and purge deletions that
way.  [Apologies if my fuzzy memory made this paragraph a fuzzy
explanation.]

-scott
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to