> If you want the data for other purposes, you could almost implement a > hack to get is. For instance, "SELECT count(docid) FROM fts_table > WHERE fts_table MATCH 'x';" is kind of close, and you could change the > match to handle prefix stuff. Unfortunately, the fts vtable cursor > sets up a full table scan against the %_content table to satisfy > fulltextColumn() requests. If the code could identify (or intuit) the > case where you were just requesting the docid, then fulltextColumn() > could return that directly from the index, which would be a LOT > faster.
FWIW, new versions do that. If you only read the docid/rowid column, the %_content table is not used by an FTS3 query. Dan. > > -scott > > > On Thu, Dec 24, 2009 at 12:30 AM, Max Vlasov <max.vla...@gmail.com> > wrote: >> Thanks, Scott, >> >> I see problems with regular tables, but to discuss in details I >> have to make >> some tests comparing the approaches. >> >> I read once again the info about fts3 internals and would like to >> share new >> thoughts. >> Let's forget about "hit count". Although I mentioned google >> suggestions, it >> was mainly for recognition of the principle of popup hint list, my >> primary >> interest is just in getting tokens list that starts with entered >> string. >> >> So if fts3 data stored in b-trees, where a token is the key, there >> is a >> similarity in how sqlite searches for range of indexed varchars and >> how fts3 >> searches for tokens. You mentioned that a problem exists with merging >> results, but general full-text search already have to merge them >> for found >> doclist so the phrase "..Of course, this scheme makes queries more >> expensive >> (as the FTS3 code may have to look up individual terms in more than >> one >> b-tree and merge the results).." is appliable to tokens as well. >> Moreover, I suppose that the code in fts3 that handles queries with >> wildcards like Match("some*") can also be used for querying tokens >> matching >> the same mask, the difference probably that the former should merge >> doclists >> (from different segments/nodes) while the latter tokens themselves. >> >> One of the problems (if such queries can be implemented >> technically) is the >> proper language sintax. As the tokens are not columns of the >> virtual table, >> this syntax have to be different from general Select. >> >> On Wed, Dec 23, 2009 at 8:51 PM, Scott Hess <sh...@google.com> wrote: >> >>> Doing this in the tokenizer is probably not great because you can't >>> determine which of the tokens actually are in the database after >>> accounting for deletes and updates. When I was last thinking about >>> this problem, I think the best option looked like keeping a distinct >>> token index with token_id stored in the fts index. Another >>> alternative was to rearrange how segments are stored and merged so >>> that the token part of the index would float above the doclists. >>> >>> If you wanted to hack something in right now, you could keep a set >>> of >>> parallel segments mapping tokens to counts of the number of hits for >>> that token. A document insert would generate a bunch of positive >>> counts, a document delete a bunch of negative counts, merges would >>> net >>> them out, and if a count went to 0 that token would be dropped from >>> that segment. Queries would still be somewhat costly, but this >>> index >>> could probably be merged more aggressively than the fts index (it >>> should be much smaller). >>> >>> If your database isn't too large, you could maybe even keep this >>> in a >>> regular SQLite-style table. Maybe, maybe not, fts1 did something >>> like >>> that and it got terribly slow once the fts table had a few tens of >>> thousands of documents. The problem was that the tokens were >>> distributed across a large portion of the index, so data locality >>> went >>> down the drain and every update was like a full table scan. >>> >>> -scott >>> >>> >>> On Wed, Dec 23, 2009 at 7:21 AM, Max Vlasov <max.vla...@gmail.com> >>> wrote: >>>> Scott, thank, >>>> >>>> I see that the hit counts is the next level of complexity. >>>> As for tokens, I think that if one really needs to implement this >>> feature, >>>> there would be a possible solution in allowing the program to >>>> "mirror" >>>> tokenized" data. Right now as I see it would be possible (if a >>>> custom >>>> tokenizer implemented), to collect such data inside xNext >>>> function, but >>>> there are problems. For example, we should know we're not parsing >>>> Match >>>> query. Is there a way to distinguish whether xNext works with >>>> MATCH >>>> operator content or from UPDATE or INSERT INTO? Because it makes >>>> no sense >>> to >>>> collect data from MATCH, because it may contain irrelevant, non- >>>> exising >>>> words (but ironically it would be helpful for collecting "hit" >>>> data). If >>> we >>>> know inside xNext that the call from a real data appending then >>>> at least >>> we >>>> will be able to maintain a dictionary of words used at least once. >>>> >>>> Max >>>> >>>> On Wed, Dec 23, 2009 at 5:54 PM, Scott Hess <sh...@google.com> >>>> wrote: >>>> >>>>> The tokens are all there, so it is "theoretically possible". >>>>> But the >>>>> performance would be very bad because it would require iterating >>>>> over >>>>> all the segment data, basically a worst-case query. Even then, >>>>> you >>>>> wouldn't have information about hit counts. To do it reasonably >>>>> would >>>>> require redesigning the data format to take this use-case into >>>>> consideration. >>>>> >>>>> -scott >>>>> >>>>> On Wed, Dec 23, 2009 at 3:56 AM, Max Vlasov <max.vla...@gmail.com> >>> wrote: >>>>>> Is it theoretically possible to allow some kind of sql SELECT >>>>>> query >>>>>> operations with tokens? >>>>>> >>>>>> Particulary one could implement suggestions similar to Google >>> suggestions >>>>> in >>>>>> Web search, when for example "some" typed in a search box and a >>> listbox >>>>>> popped up with this database tokens starting with these letters >>>>>> shown >>>>>> ("someone", "something" etc). >>>>>> Having some experience with full-text in past once I did my own >>>>>> simple >>>>>> implementation of full-text for sqlite, where all tokens saved >>>>>> in a >>>>>> separated table that also has a text index, so in this case >>>>>> SELECT >>> Title >>>>>> FROM Tokens WHERE Title LIKE "some%" works as expected (with >>>>>> case-sensitivity pragma effectively set). I tried to read the >>> technical >>>>> part >>>>>> of http://www.sqlite.org/fts3.html document, but could not >>>>>> figure out >>>>>> whether it is possible to implement this in the current >>>>>> implementation >>> of >>>>>> fts3 >>>>>> >>>>>> Thanks >>>>>> >>>>>> Max >>>>>> _______________________________________________ >>>>>> sqlite-users mailing list >>>>>> sqlite-users@sqlite.org >>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >>>>>> >>>>> _______________________________________________ >>>>> sqlite-users mailing list >>>>> sqlite-users@sqlite.org >>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >>>>> >>>> _______________________________________________ >>>> sqlite-users mailing list >>>> sqlite-users@sqlite.org >>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >>>> >>> _______________________________________________ >>> sqlite-users mailing list >>> sqlite-users@sqlite.org >>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >>> >> _______________________________________________ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users