Re: [sqlite] Why FTS3 has the limitations it does

John Crenshaw Mon, 19 Oct 2009 11:26:03 -0700

> Doing an fts index which can handle subset scans efficiently is going
to be hard.

I noticed. After some thought, here's what I've come up with:

We'll call nT the number of terms and nD the number of docids in a given
term. nTD is the number of rows in a natural join of terms and docids.

The current runtime to lookup a term for a match is O(log nT) (right?)

The current best possible runtime to lookup a docid within a term is
O(log nT) + O(nD). This is fine if nD is small, but rapidly becomes a
major problem as nD gets larger.

If we doubled the size of the index, adding an extra tree to index terms
and docids together, we could get O(nTD) for a lookup. This is the ideal
runtime (assuming we don't redesign the world and use a hash), but
requires extra code, and a lot of extra space.

On the other hand, we could add a tree inside each segment to index the
doclist. The term would be looked up as normal at a cost of O(log nT).
After that though, if the docid is known, it could be looked up at an
additional cost of only O(log nD). The total cost O(log nT) + O(log nD)
is only marginally worse than O(log nTD) (and only because nTD is the
count of a natural join, rather than a true product of the counts.)

The result is still pretty expensive for individual rows, but it is a
whole lot better than it is now, and it avoids full scans.

This still doesn't offer direct access to the doc lists by docid (you
still need terms) but that problem should easy to solve once the term +
docid case is handled separately, because only the docid needs to be
indexed at that point.

I think the right way to do this is to have the doclist point back to
the term it belongs to. Then a list of doclists could be stored with the
regular data for each row (it is known at that point, so requires no
extra calculation.)

These changes still require a data format change, but worst case that
means incrementing the version.

Does anyone see a reason why this wouldn't work?

John

-----Original Message-----
From: sqlite-users-boun...@sqlite.org
[mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Scott Hess
Sent: Monday, October 19, 2009 12:51 PM
To: General Discussion of SQLite Database
Cc: punk...@eidesis.org
Subject: Re: [sqlite] Why FTS3 has the limitations it does

On Sat, Oct 17, 2009 at 1:25 PM, John Crenshaw
<johncrens...@priacta.com> wrote:
> Agreed, HUGE thanks for FTS. Hopefully my original post didn't
> come off ungrateful.  I was just confused by limitations that
> looked like they could have been removed during the initial
> design (at least more easily than they can now.) Scott's reply
> helps me understand this better, and perhaps gives some
> starting points for finding a solution.

One of the things I found challenging about fts development was that
being embedded w/in SQLite made some problems harder.  You can't just
spin up a book-keeping thread to do stuff in the background, and you
can't easily expose a grungy API to let the client do it, either.
Plus you have the issues of shipping a framework (such as not being
able to arbitrarily change the file format on a whim, even if it's
WRONG).  This meant that in many cases I was a bit aggressive in
pruning features up front, to scope things appropriately, and once
committed to a file format some things just couldn't be added.

> The idea of using the tokenizer output and doing a direct match
> is intriguing. A full content scan is expensive (that is the
> point of indexing,) but guess this is usually less expensive
> than a full index scan for single rows (especially for large
> indexes), and would eliminate the current limitations.

Doing an fts index which can handle subset scans efficiently is going
to be hard.  Like a lot of systems fts3 uses segments to keep index
updates manageable, but this means that you can't just do a single
b-tree intersection, you have to look at multiple b-trees, so you'll
end up hitting a greater fraction of the index footprint to do the
query.  You could get a CPU win by having the code at least not keep
more of the doclist data than needed around.

One thing I had been considering adding was some stats data so that
you could easily determine the magnitude of the doclist for a term.
In this case, if that info suggested that the index wasn't much bigger
than the subset of interest, use the index, otherwise use a content
scan.

> Supposing someone wanted to update FTS3, how would they get
> write access to the main code repository?

That's for the SQLite team (I've been pretty quiet on that front,
lately, so will not speak for them).

-scott
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Why FTS3 has the limitations it does

Reply via email to