On Thu, Sep 11, 2014 at 8:58 AM, Dan Kennedy <danielk1977 at gmail.com> wrote: > Fts5 is still in the experimental stage at the moment. > > If anybody has any ideas for useful features, or knows of problems with FTS4 > that could be fixed in FTS5, don't keep them to yourself!
Apologies for not noticing this thread earlier! After fts2 was released, someone engaged me on a discussion about whether I had considered an alternate storage strategy. The current system of {term,doclist} where doclist is something like [{docid,[pos]}] means that the index b-tree is very lumpy because doclists are (extremely) variably-sized. The suggestion was to store things as an ordered set of {term,doc,pos} tuples, then use some sort of delta encoding between them. This would quite naturally balance the interior of the index versus the leaves, and would also work well with incremental merging since you only needed to worry about the head block for each segment being scanned. I believe the current fts5 code gets similar results by keeping an index for large doclists to allow quickly scanning to the right point, so this might not add much. Something that bugged me a lot was that I had used deletion markers to cancel out hits, but did not provide a way for deletion markers to cancel out. The main problem with this was that a large delete would stay in the system until it reached the final segment, even if it had already overtaken all of the original inserts. I wished that I had either maintained a separate structure tracking _document_ deletion (which would make merges somewhat more complicated because they wouldn't be term-centric), or code updates as "delete+insert". In the latter case deletes could drop out at the point where they reached the original insert. I seem to recall being upset by the amount of compression gzip could manage against index blocks, even though they mostly aren't very large. I think things got around 1/4 or 1/3 smaller. To me that implied that there were probably some gains to be had in encoding. [This is distinct from compression of content data, which fts3/4 already support.] I'm 100% convinced that merging could be improved :-). Clearly there is a lot of benefit to merging together the low-order segments, but I never figured out a good way to model whether merging the larger segments actually improved anything, since at some point you no longer can really enforce locality anyhow. But I'm guessing that your experiments with the sqlite4 key/value store probably involve lots of exploration along these lines. -scott