> Doing an fts index which can handle subset scans efficiently is going to be hard.
I noticed. After some thought, here's what I've come up with: We'll call nT the number of terms and nD the number of docids in a given term. nTD is the number of rows in a natural join of terms and docids. The current runtime to lookup a term for a match is O(log nT) (right?) The current best possible runtime to lookup a docid within a term is O(log nT) + O(nD). This is fine if nD is small, but rapidly becomes a major problem as nD gets larger. If we doubled the size of the index, adding an extra tree to index terms and docids together, we could get O(nTD) for a lookup. This is the ideal runtime (assuming we don't redesign the world and use a hash), but requires extra code, and a lot of extra space. On the other hand, we could add a tree inside each segment to index the doclist. The term would be looked up as normal at a cost of O(log nT). After that though, if the docid is known, it could be looked up at an additional cost of only O(log nD). The total cost O(log nT) + O(log nD) is only marginally worse than O(log nTD) (and only because nTD is the count of a natural join, rather than a true product of the counts.) The result is still pretty expensive for individual rows, but it is a whole lot better than it is now, and it avoids full scans. This still doesn't offer direct access to the doc lists by docid (you still need terms) but that problem should easy to solve once the term + docid case is handled separately, because only the docid needs to be indexed at that point. I think the right way to do this is to have the doclist point back to the term it belongs to. Then a list of doclists could be stored with the regular data for each row (it is known at that point, so requires no extra calculation.) These changes still require a data format change, but worst case that means incrementing the version. Does anyone see a reason why this wouldn't work? John -----Original Message----- From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Scott Hess Sent: Monday, October 19, 2009 12:51 PM To: General Discussion of SQLite Database Cc: punk...@eidesis.org Subject: Re: [sqlite] Why FTS3 has the limitations it does On Sat, Oct 17, 2009 at 1:25 PM, John Crenshaw <johncrens...@priacta.com> wrote: > Agreed, HUGE thanks for FTS. Hopefully my original post didn't > come off ungrateful. I was just confused by limitations that > looked like they could have been removed during the initial > design (at least more easily than they can now.) Scott's reply > helps me understand this better, and perhaps gives some > starting points for finding a solution. One of the things I found challenging about fts development was that being embedded w/in SQLite made some problems harder. You can't just spin up a book-keeping thread to do stuff in the background, and you can't easily expose a grungy API to let the client do it, either. Plus you have the issues of shipping a framework (such as not being able to arbitrarily change the file format on a whim, even if it's WRONG). This meant that in many cases I was a bit aggressive in pruning features up front, to scope things appropriately, and once committed to a file format some things just couldn't be added. > The idea of using the tokenizer output and doing a direct match > is intriguing. A full content scan is expensive (that is the > point of indexing,) but guess this is usually less expensive > than a full index scan for single rows (especially for large > indexes), and would eliminate the current limitations. Doing an fts index which can handle subset scans efficiently is going to be hard. Like a lot of systems fts3 uses segments to keep index updates manageable, but this means that you can't just do a single b-tree intersection, you have to look at multiple b-trees, so you'll end up hitting a greater fraction of the index footprint to do the query. You could get a CPU win by having the code at least not keep more of the doclist data than needed around. One thing I had been considering adding was some stats data so that you could easily determine the magnitude of the doclist for a term. In this case, if that info suggested that the index wasn't much bigger than the subset of interest, use the index, otherwise use a content scan. > Supposing someone wanted to update FTS3, how would they get > write access to the main code repository? That's for the SQLite team (I've been pretty quiet on that front, lately, so will not speak for them). -scott _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users