On Thu, Sep 11, 2014 at 8:58 AM, Dan Kennedy <danielk1977 at gmail.com> wrote:
> Fts5 is still in the experimental stage at the moment.
>
> If anybody has any ideas for useful features, or knows of problems with FTS4
> that could be fixed in FTS5, don't keep them to yourself!

Apologies for not noticing this thread earlier!

After fts2 was released, someone engaged me on a discussion about
whether I had considered an alternate storage strategy.  The current
system of {term,doclist} where doclist is something like
[{docid,[pos]}] means that the index b-tree is very lumpy because
doclists are (extremely) variably-sized.  The suggestion was to store
things as an ordered set of {term,doc,pos} tuples, then use some sort
of delta encoding between them.  This would quite naturally balance
the interior of the index versus the leaves, and would also work well
with incremental merging since you only needed to worry about the head
block for each segment being scanned.  I believe the current fts5 code
gets similar results by keeping an index for large doclists to allow
quickly scanning to the right point, so this might not add much.

Something that bugged me a lot was that I had used deletion markers to
cancel out hits, but did not provide a way for deletion markers to
cancel out.  The main problem with this was that a large delete would
stay in the system until it reached the final segment, even if it had
already overtaken all of the original inserts.  I wished that I had
either maintained a separate structure tracking _document_ deletion
(which would make merges somewhat more complicated because they
wouldn't be term-centric), or code updates as "delete+insert".  In the
latter case deletes could drop out at the point where they reached the
original insert.

I seem to recall being upset by the amount of compression gzip could
manage against index blocks, even though they mostly aren't very
large.  I think things got around 1/4 or 1/3 smaller.  To me that
implied that there were probably some gains to be had in encoding.
[This is distinct from compression of content data, which fts3/4
already support.]

I'm 100% convinced that merging could be improved :-).  Clearly there
is a lot of benefit to merging together the low-order segments, but I
never figured out a good way to model whether merging the larger
segments actually improved anything, since at some point you no longer
can really enforce locality anyhow.  But I'm guessing that your
experiments with the sqlite4 key/value store probably involve lots of
exploration along these lines.

-scott

Reply via email to