Re: [sqlite] Why FTS3 has the limitations it does
On Mon, Oct 19, 2009 at 11:25 AM, John Crenshawwrote: > On the other hand, we could add a tree inside each segment to index the > doclist. The term would be looked up as normal at a cost of O(log nT). > After that though, if the docid is known, it could be looked up at an > additional cost of only O(log nD). The total cost O(log nT) + O(log nD) > is only marginally worse than O(log nTD) (and only because nTD is the > count of a natural join, rather than a true product of the counts.) The current code deals with variably-sized doclists by aggregating small ones into a leaf block, while large ones get their own leaf block, which can require overflow pages (at the SQLite level). One of my ideas for future development was to augment the index btree to store keys as term+docid. Large doclists could then be spread across multiple leaves and more efficiently accessed. There would be an extra byte per key in the interior nodes, plus there would be generally more info above the leaf nodes, but I think it would be pretty reasonable. I once had an off-list discussion with someone about more generally modeling the index as a sorted set of tuples, which can then be delta-encoded into leaf nodes and built up more-or-less as things currently work. That would allow leaf nodes to be uniformly-sized. It was a big change, though, in an area I wasn't enthusiastic about revisiting at the time. > This still doesn't offer direct access to the doc lists by docid (you > still need terms) but that problem should easy to solve once the term + > docid case is handled separately, because only the docid needs to be > indexed at that point. I think that this approach may not offer as much performance increase as one might like. Theoretically, you may read 1/10 as many pages for a particular query, but when you consider seeks and streaming, the gains may not be measurable when compared to a more stupid implementation (read the existing doclist and trim it based on inputs). Just really depends on things like fragmentation and the like. Which leads me to the next point. There's no gain to doing any of this is you cannot phrase the question in terms of existing SQLite structures. I'm not entirely certain that the current code has any facility for running MATCH against a set of rowids. I think you get one at a time. That's a big hit for any implementation, but if things come in sorted order it may be something that can be masked by caching some info in the cursor. You could try writing a MATCH function where the doclists needed are faulted in as needed, and each row trims the inputs to just that docid before executing the query. If things come in sorted order you can just open doclist-readers and scan once, so that part would drop from N^2 to N. I think doing this would be a reasonable starter project, because it could be done with no storage-format changes, but it would generate a familiarity with lots of bits of the code. And I think there's half a chance it would perform reasonably well for many cases. -scott ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Wanadoo Hartwig wrote: > It seems to be that ticket 3950 was also closed > (I actually opened this ticket using version 3.6.16). The reason for > closure was that the bug seems to have disappeared It is actually closed. I'm the one who has been going through all the old tickets and transferring the ones that still seem to be well defined existing bugs into the new ticket system and closing the rest directing further discussion to this mailing list to refine in detail what the issue is. http://www.sqlite.org/src/wiki?name=Bug+Reports This also means don't blame the SQLite dev team for the rather terse comments in some of them :-) I originally thought there weren't too many but only worked out yesterday that the report limits to 100. There would appear to be about 600 more to process in addition to however many hundred I already did! Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkreHz8ACgkQmOOfHg372QQ0mgCeOaJaKlhvGU/+Czu9a/n4Ltik +ZsAoMKE1gqx13cbu9lfttR6KcYVqsZQ =1O+q -END PGP SIGNATURE- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
Hi Scott, thanks for the reply. It seems to be that ticket 3950 was also closed (I actually opened this ticket using version 3.6.16). The reason for closure was that the bug seems to have disappeared in version 3.6.19. I have not checked it by myself as I have not installed 3.6.19, yet. I will report the outcome. Hartwig Am 19.10.2009 um 18:27 schrieb Scott Hess: > Here's a long-ago thread on this: > http://www.mail-archive.com/sqlite-users@sqlite.org/msg30540.html > Looks like it hasn't been addressed, and I've yet to come up for air > on it. > > There's a ticket out there which looks like the same thing: > http://www.sqlite.org/cvstrac/tktview?tn=3338 > which is closed, but it sounds like if you put up a simple repro-case > you might get a re-hearing on it. > > -scott > > > On Mon, Oct 19, 2009 at 9:13 AM, Wanadoo Hartwig >wrote: >> >> Am 18.10.2009 um 18:55 schrieb Roger Binns: >> >>> Wanadoo Hartwig wrote: Slightly different question but related to FTS3. Does anybody know why this fails using FTS3? >>> >>> It isn't failing. Behind the scenes FTS3 is implemented using 3 >>> other >>> tables (try .dump to see). You are indeed seeing the last inserted >>> rowid. >>> >> >> As the FTS related tables are modified by a trigger the last rowid >> should still be the one of the table "triggering the trigger" and not >> of any the tables being modified by the trigger. Isn't this the >> general idea of a trigger? >> >> Hartwig >> >>> Roger >>> ___ >>> sqlite-users mailing list >>> sqlite-users@sqlite.org >>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> >> ___ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
> Doing an fts index which can handle subset scans efficiently is going to be hard. I noticed. After some thought, here's what I've come up with: We'll call nT the number of terms and nD the number of docids in a given term. nTD is the number of rows in a natural join of terms and docids. The current runtime to lookup a term for a match is O(log nT) (right?) The current best possible runtime to lookup a docid within a term is O(log nT) + O(nD). This is fine if nD is small, but rapidly becomes a major problem as nD gets larger. If we doubled the size of the index, adding an extra tree to index terms and docids together, we could get O(nTD) for a lookup. This is the ideal runtime (assuming we don't redesign the world and use a hash), but requires extra code, and a lot of extra space. On the other hand, we could add a tree inside each segment to index the doclist. The term would be looked up as normal at a cost of O(log nT). After that though, if the docid is known, it could be looked up at an additional cost of only O(log nD). The total cost O(log nT) + O(log nD) is only marginally worse than O(log nTD) (and only because nTD is the count of a natural join, rather than a true product of the counts.) The result is still pretty expensive for individual rows, but it is a whole lot better than it is now, and it avoids full scans. This still doesn't offer direct access to the doc lists by docid (you still need terms) but that problem should easy to solve once the term + docid case is handled separately, because only the docid needs to be indexed at that point. I think the right way to do this is to have the doclist point back to the term it belongs to. Then a list of doclists could be stored with the regular data for each row (it is known at that point, so requires no extra calculation.) These changes still require a data format change, but worst case that means incrementing the version. Does anyone see a reason why this wouldn't work? John -Original Message- From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Scott Hess Sent: Monday, October 19, 2009 12:51 PM To: General Discussion of SQLite Database Cc: punk...@eidesis.org Subject: Re: [sqlite] Why FTS3 has the limitations it does On Sat, Oct 17, 2009 at 1:25 PM, John Crenshaw <johncrens...@priacta.com> wrote: > Agreed, HUGE thanks for FTS. Hopefully my original post didn't > come off ungrateful. I was just confused by limitations that > looked like they could have been removed during the initial > design (at least more easily than they can now.) Scott's reply > helps me understand this better, and perhaps gives some > starting points for finding a solution. One of the things I found challenging about fts development was that being embedded w/in SQLite made some problems harder. You can't just spin up a book-keeping thread to do stuff in the background, and you can't easily expose a grungy API to let the client do it, either. Plus you have the issues of shipping a framework (such as not being able to arbitrarily change the file format on a whim, even if it's WRONG). This meant that in many cases I was a bit aggressive in pruning features up front, to scope things appropriately, and once committed to a file format some things just couldn't be added. > The idea of using the tokenizer output and doing a direct match > is intriguing. A full content scan is expensive (that is the > point of indexing,) but guess this is usually less expensive > than a full index scan for single rows (especially for large > indexes), and would eliminate the current limitations. Doing an fts index which can handle subset scans efficiently is going to be hard. Like a lot of systems fts3 uses segments to keep index updates manageable, but this means that you can't just do a single b-tree intersection, you have to look at multiple b-trees, so you'll end up hitting a greater fraction of the index footprint to do the query. You could get a CPU win by having the code at least not keep more of the doclist data than needed around. One thing I had been considering adding was some stats data so that you could easily determine the magnitude of the doclist for a term. In this case, if that info suggested that the index wasn't much bigger than the subset of interest, use the index, otherwise use a content scan. > Supposing someone wanted to update FTS3, how would they get > write access to the main code repository? That's for the SQLite team (I've been pretty quiet on that front, lately, so will not speak for them). -scott ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
On Sat, Oct 17, 2009 at 1:25 PM, John Crenshawwrote: > Agreed, HUGE thanks for FTS. Hopefully my original post didn't > come off ungrateful. I was just confused by limitations that > looked like they could have been removed during the initial > design (at least more easily than they can now.) Scott's reply > helps me understand this better, and perhaps gives some > starting points for finding a solution. One of the things I found challenging about fts development was that being embedded w/in SQLite made some problems harder. You can't just spin up a book-keeping thread to do stuff in the background, and you can't easily expose a grungy API to let the client do it, either. Plus you have the issues of shipping a framework (such as not being able to arbitrarily change the file format on a whim, even if it's WRONG). This meant that in many cases I was a bit aggressive in pruning features up front, to scope things appropriately, and once committed to a file format some things just couldn't be added. > The idea of using the tokenizer output and doing a direct match > is intriguing. A full content scan is expensive (that is the > point of indexing,) but guess this is usually less expensive > than a full index scan for single rows (especially for large > indexes), and would eliminate the current limitations. Doing an fts index which can handle subset scans efficiently is going to be hard. Like a lot of systems fts3 uses segments to keep index updates manageable, but this means that you can't just do a single b-tree intersection, you have to look at multiple b-trees, so you'll end up hitting a greater fraction of the index footprint to do the query. You could get a CPU win by having the code at least not keep more of the doclist data than needed around. One thing I had been considering adding was some stats data so that you could easily determine the magnitude of the doclist for a term. In this case, if that info suggested that the index wasn't much bigger than the subset of interest, use the index, otherwise use a content scan. > Supposing someone wanted to update FTS3, how would they get > write access to the main code repository? That's for the SQLite team (I've been pretty quiet on that front, lately, so will not speak for them). -scott ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
Here's a long-ago thread on this: http://www.mail-archive.com/sqlite-users@sqlite.org/msg30540.html Looks like it hasn't been addressed, and I've yet to come up for air on it. There's a ticket out there which looks like the same thing: http://www.sqlite.org/cvstrac/tktview?tn=3338 which is closed, but it sounds like if you put up a simple repro-case you might get a re-hearing on it. -scott On Mon, Oct 19, 2009 at 9:13 AM, Wanadoo Hartwigwrote: > > Am 18.10.2009 um 18:55 schrieb Roger Binns: > >> Wanadoo Hartwig wrote: >>> Slightly different question but related to FTS3. Does anybody know >>> why >>> this fails using FTS3? >> >> It isn't failing. Behind the scenes FTS3 is implemented using 3 other >> tables (try .dump to see). You are indeed seeing the last inserted >> rowid. >> > > As the FTS related tables are modified by a trigger the last rowid > should still be the one of the table "triggering the trigger" and not > of any the tables being modified by the trigger. Isn't this the > general idea of a trigger? > > Hartwig > >> Roger >> ___ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
Am 18.10.2009 um 18:55 schrieb Roger Binns: > Wanadoo Hartwig wrote: >> Slightly different question but related to FTS3. Does anybody know >> why >> this fails using FTS3? > > It isn't failing. Behind the scenes FTS3 is implemented using 3 other > tables (try .dump to see). You are indeed seeing the last inserted > rowid. > As the FTS related tables are modified by a trigger the last rowid should still be the one of the table "triggering the trigger" and not of any the tables being modified by the trigger. Isn't this the general idea of a trigger? Hartwig > Roger > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
Wanadoo Hartwig wrote: > Slightly different question but related to FTS3. Does anybody know why > this fails using FTS3? It isn't failing. Behind the scenes FTS3 is implemented using 3 other tables (try .dump to see). You are indeed seeing the last inserted rowid. Roger ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
Slightly different question but related to FTS3. Does anybody know why this fails using FTS3? CREATE TABLE Simple (ID integer primary key, Name text); CREATE VIRTUAL TABLE SimpleFTS USING FTS3 (Name); CREATE TRIGGER DeleteTrigger AFTER DELETE ON Simple FOR EACH ROW BEGIN DELETE FROM SimpleFTS WHERE (rowid=OLD.ID); END; CREATE TRIGGER InsertTrigger AFTER INSERT ON Simple FOR EACH ROW BEGIN INSERT INTO SimpleFTS (rowid,Name) VALUES(NEW.ID,NEW.Name); END; INSERT INTO Simple (Name) VALUES('one'); INSERT INTO Simple (Name) VALUES('two'); DELETE FROM Simple WHERE (ID = 1); INSERT INTO Simple (Name) VALUES('three'); SELECT * FROM Simple; SELECT last_insert_rowid() FROM Simple; The output is: 2|two 3|three 4 <-- BUG?! 4 4 Hartwig PS: This fails only with FTS3. If you use any other (virtual) table it works. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Why FTS3 has the limitations it does
Agreed, HUGE thanks for FTS. Hopefully my original post didn't come off ungrateful. I was just confused by limitations that looked like they could have been removed during the initial design (at least more easily than they can now.) Scott's reply helps me understand this better, and perhaps gives some starting points for finding a solution. The idea of using the tokenizer output and doing a direct match is intriguing. A full content scan is expensive (that is the point of indexing,) but guess this is usually less expensive than a full index scan for single rows (especially for large indexes), and would eliminate the current limitations. As far as continued development, there is a "tracker FTS" branch available that appears to be active. See http://git.gnome.org/cgit/tracker/tree/src/libtracker-fts. It looks like there is also continued active development on it: http://git.gnome.org/cgit/tracker/log/?qt=grep=FTS. The tracker-fts code adds ranking and some other important functionality, but it is hard to separate from the rest of tracker. The tracker-fts files are public domain (SQLite license) but they have some dependencies on other parts of tracker that are not. Also, at least as of a few months ago, I think they were based on an earlier version of FTS3. Supposing someone wanted to update FTS3, how would they get write access to the main code repository? John -Original Message- From: sqlite-users-boun...@sqlite.org [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of P Kishor Sent: Friday, October 16, 2009 4:23 PM To: General Discussion of SQLite Database Subject: Re: [sqlite] Why FTS3 has the limitations it does On Fri, Oct 16, 2009 at 3:12 PM, Scott Hess <sh...@google.com> wrote: > On Wed, Oct 14, 2009 at 11:35 PM, John Crenshaw > <johncrens...@priacta.com> wrote: >> The severe limitations on FTS3 seemed odd to me, but I figured I could >> live with them. Then I starting finding that various queries were giving >> strange "out of context" errors with the MATCH operator, even though I >> was following all the documented rules. As a result I started looking >> deeply into what is going on with FTS3 and I found something that >> bothers me. >> >> These limitations are really completely arbitrary. They should be >> removable. > > fts is mostly the way it is because that was the amount that got done > before I lost the motivation to carry it further. The set of possible > improvements is vast, but they need a motivated party to carry them > forward. Some of the integration with SQLite is the way it is mostly > because it was decided to keep fts outside of SQLite core. Feel free > to dive in and improve it. > >> You can only use a single index to query a table, after that everything >> else has to be done with a scan of the results, fair enough. But with >> FTS3, the match operator works ONLY when the match expression is >> selected for the index. This means that if a query could allow a row to >> be selected by either rowid, or a MATCH expression, you can have a >> problem. If the rowid is selected for use as the index, the MATCH won't >> be used as the index, and you get errors. Similarly, a query with two >> MATCH expressions will only be able to use one as the index, so you get >> errors from the second. > > The MATCH code probes term->doclist, there is no facility for probing > by docid. At minimum the document will need to be tokenized. > Worst-case, you could tokenize it to an in-memory segment and probe > that, which would make good re-use of existing code. Most efficient > would be to somehow match directly against the tokenizer output (you > could look at the snippeting code for hints there). > >> My first question is, why was FTS designed like this in the first place? > > Because running MATCH against a subset of the table was not considered > an important use case when designing it? > >> Surely this was clear during the design stage, when the design could >> have been easily changed to accommodate the lookups required for a MATCH >> function. Is there some compelling performance benefit? Something I >> missed? > > "Easily" is all relative. There were plenty of hard problems to be > solved without looking around for a bunch of easy ones to tack on. > >> My second question is, can we expect this to change at some point? > > Probably not unless someone out there decides to. I got kind of > burned out on fts about a year back. With immense gratitude expressed here to Scott, I feel a bit disappointed that FTS has fallen out of the core, and out of "continued development and improvement." It is really a brilliant piece of work that makes sqlite eminently more usable for a number
Re: [sqlite] Why FTS3 has the limitations it does
On Fri, Oct 16, 2009 at 3:12 PM, Scott Hesswrote: > On Wed, Oct 14, 2009 at 11:35 PM, John Crenshaw > wrote: >> The severe limitations on FTS3 seemed odd to me, but I figured I could >> live with them. Then I starting finding that various queries were giving >> strange "out of context" errors with the MATCH operator, even though I >> was following all the documented rules. As a result I started looking >> deeply into what is going on with FTS3 and I found something that >> bothers me. >> >> These limitations are really completely arbitrary. They should be >> removable. > > fts is mostly the way it is because that was the amount that got done > before I lost the motivation to carry it further. The set of possible > improvements is vast, but they need a motivated party to carry them > forward. Some of the integration with SQLite is the way it is mostly > because it was decided to keep fts outside of SQLite core. Feel free > to dive in and improve it. > >> You can only use a single index to query a table, after that everything >> else has to be done with a scan of the results, fair enough. But with >> FTS3, the match operator works ONLY when the match expression is >> selected for the index. This means that if a query could allow a row to >> be selected by either rowid, or a MATCH expression, you can have a >> problem. If the rowid is selected for use as the index, the MATCH won't >> be used as the index, and you get errors. Similarly, a query with two >> MATCH expressions will only be able to use one as the index, so you get >> errors from the second. > > The MATCH code probes term->doclist, there is no facility for probing > by docid. At minimum the document will need to be tokenized. > Worst-case, you could tokenize it to an in-memory segment and probe > that, which would make good re-use of existing code. Most efficient > would be to somehow match directly against the tokenizer output (you > could look at the snippeting code for hints there). > >> My first question is, why was FTS designed like this in the first place? > > Because running MATCH against a subset of the table was not considered > an important use case when designing it? > >> Surely this was clear during the design stage, when the design could >> have been easily changed to accommodate the lookups required for a MATCH >> function. Is there some compelling performance benefit? Something I >> missed? > > "Easily" is all relative. There were plenty of hard problems to be > solved without looking around for a bunch of easy ones to tack on. > >> My second question is, can we expect this to change at some point? > > Probably not unless someone out there decides to. I got kind of > burned out on fts about a year back. With immense gratitude expressed here to Scott, I feel a bit disappointed that FTS has fallen out of the core, and out of "continued development and improvement." It is really a brilliant piece of work that makes sqlite eminently more usable for a number of tasks. I was setting up a "tagging" system, and what a nightmare that was until I realized that I don't have to develop one. I can just use FTS! It sort of fits right in Google's philosophy that I summarized in my write-up a while back http://punkish.org/Why-File-When-You-Can-Full-Text-Search I don't have the smarts to work on FTS, only the smarts to realize that it is a great thing to use. I hope someone will pick it up and that eventually we will have FTS4. Makes life so much easier instead of dicking around with external search mechanisms such as Lucene or swish-e or htdig, etc. > >> All that is needed is the ability to lookup by a combination >> of docid and term. Isn't a hash already built while creating a list of >> terms for storage? What if that hash were stored, indexed by docid? > > In database world, space==time. Storing more data means the system gets > slower. > > -scott > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- Puneet Kishor http://www.punkish.org Carbon Model http://carbonmodel.org Charter Member, Open Source Geospatial Foundation http://www.osgeo.org Science Commons Fellow, http://sciencecommons.org/about/whoweare/kishor Nelson Institute, UW-Madison http://www.nelson.wisc.edu --- Assertions are politics; backing up assertions with evidence is science === Sent from Madison, WI, United States ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users