On 04/08/2015 04:49 AM, Scott Hess wrote: > On Thu, Sep 11, 2014 at 8:58 AM, Dan Kennedy <danielk1977 at gmail.com> wrote: >> Fts5 is still in the experimental stage at the moment. >> >> If anybody has any ideas for useful features, or knows of problems with FTS4 >> that could be fixed in FTS5, don't keep them to yourself! > Apologies for not noticing this thread earlier! > > After fts2 was released, someone engaged me on a discussion about > whether I had considered an alternate storage strategy. The current > system of {term,doclist} where doclist is something like > [{docid,[pos]}] means that the index b-tree is very lumpy because > doclists are (extremely) variably-sized. The suggestion was to store > things as an ordered set of {term,doc,pos} tuples, then use some sort > of delta encoding between them. This would quite naturally balance > the interior of the index versus the leaves, and would also work well > with incremental merging since you only needed to worry about the head > block for each segment being scanned. I believe the current fts5 code > gets similar results by keeping an index for large doclists to allow > quickly scanning to the right point, so this might not add much. > > Something that bugged me a lot was that I had used deletion markers to > cancel out hits, but did not provide a way for deletion markers to > cancel out. The main problem with this was that a large delete would > stay in the system until it reached the final segment, even if it had > already overtaken all of the original inserts. I wished that I had > either maintained a separate structure tracking _document_ deletion > (which would make merges somewhat more complicated because they > wouldn't be term-centric), or code updates as "delete+insert". In the > latter case deletes could drop out at the point where they reached the > original insert.
Thanks for this. The "delete+insert" idea sounds like quite an interesting one. So instead of just "delete" and "insert" keys, the merge tree now also contains "delete+insert" keys (call them "update" keys). Then maintain the tree so that (a) for each "insert", the next youngest duplicate key must either not exist or be a "delete", (b) for each "update", the next youngest duplicate key must exist and must be an "insert" or "update", and (c) for each "delete", the next youngest duplicate key must exist and must be an "insert" or "update". And as a result, when a "delete" catches up with an "insert" while merging they can both be discarded. Instead of the current situation, where we retain the "delete" unless the output segment is the oldest in the database. Cool. I guess they don't generally do this in merge-trees because the cost of figuring out whether to use "update" or "insert" keys when writing a new segments is prohibitively high. But FTS doesn't have that problem, as it never does a true "blind write". When it clobbers a key it always knows it at time of writing. Dan. > > I seem to recall being upset by the amount of compression gzip could > manage against index blocks, even though they mostly aren't very > large. I think things got around 1/4 or 1/3 smaller. To me that > implied that there were probably some gains to be had in encoding. > [This is distinct from compression of content data, which fts3/4 > already support.] > > I'm 100% convinced that merging could be improved :-). Clearly there > is a lot of benefit to merging together the low-order segments, but I > never figured out a good way to model whether merging the larger > segments actually improved anything, since at some point you no longer > can really enforce locality anyhow. But I'm guessing that your > experiments with the sqlite4 key/value store probably involve lots of > exploration along these lines. > > -scott > _______________________________________________ > sqlite-users mailing list > sqlite-users at mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users