Re: [sqlite] SELECT from fts3 tokens, is it possible?

Dan Kennedy Mon, 28 Dec 2009 20:19:32 -0800

> If you want the data for other purposes, you could almost implement a
> hack to get is.  For instance, "SELECT count(docid) FROM fts_table
> WHERE fts_table MATCH 'x';" is kind of close, and you could change the
> match to handle prefix stuff.  Unfortunately, the fts vtable cursor
> sets up a full table scan against the %_content table to satisfy
> fulltextColumn() requests.  If the code could identify (or intuit) the
> case where you were just requesting the docid, then fulltextColumn()
> could return that directly from the index, which would be a LOT
> faster.


FWIW, new versions do that. If you only read the docid/rowid column,
the %_content table is not used by an FTS3 query.

Dan.



>
> -scott
>
>
> On Thu, Dec 24, 2009 at 12:30 AM, Max Vlasov <max.vla...@gmail.com>  
> wrote:
>> Thanks, Scott,
>>
>> I see problems with regular tables, but to discuss in details I  
>> have to make
>> some tests comparing the approaches.
>>
>> I read once again the info about fts3 internals and would like to  
>> share new
>> thoughts.
>> Let's forget about "hit count". Although I mentioned google  
>> suggestions, it
>> was mainly for recognition of the principle of popup hint list, my  
>> primary
>> interest is just in getting tokens list that starts with entered  
>> string.
>>
>> So if fts3 data stored in b-trees, where a token is the key, there  
>> is a
>> similarity in how sqlite searches for range of indexed varchars and  
>> how fts3
>> searches for tokens. You mentioned that a problem exists with merging
>> results, but general full-text search already have to merge them  
>> for found
>> doclist so the phrase "..Of course, this scheme makes queries more  
>> expensive
>> (as the FTS3 code may have to look up individual terms in more than  
>> one
>> b-tree and merge the results).." is appliable to tokens as well.
>> Moreover, I suppose that the code in fts3 that handles queries with
>> wildcards like Match("some*") can also be used for querying tokens  
>> matching
>> the same mask, the difference probably that the former should merge  
>> doclists
>> (from different segments/nodes) while the latter tokens themselves.
>>
>> One of the problems (if such queries can be implemented  
>> technically) is the
>> proper language sintax. As the tokens are not columns of the  
>> virtual table,
>> this syntax have to be different from general Select.
>>
>> On Wed, Dec 23, 2009 at 8:51 PM, Scott Hess <sh...@google.com> wrote:
>>
>>> Doing this in the tokenizer is probably not great because you can't
>>> determine which of the tokens actually are in the database after
>>> accounting for deletes and updates.  When I was last thinking about
>>> this problem, I think the best option looked like keeping a distinct
>>> token index with token_id stored in the fts index.  Another
>>> alternative was to rearrange how segments are stored and merged so
>>> that the token part of the index would float above the doclists.
>>>
>>> If you wanted to hack something in right now, you could keep a set  
>>> of
>>> parallel segments mapping tokens to counts of the number of hits for
>>> that token.  A document insert would generate a bunch of positive
>>> counts, a document delete a bunch of negative counts, merges would  
>>> net
>>> them out, and if a count went to 0 that token would be dropped from
>>> that segment.  Queries would still be somewhat costly, but this  
>>> index
>>> could probably be merged more aggressively than the fts index (it
>>> should be much smaller).
>>>
>>> If your database isn't too large, you could maybe even keep this  
>>> in a
>>> regular SQLite-style table.  Maybe, maybe not, fts1 did something  
>>> like
>>> that and it got terribly slow once the fts table had a few tens of
>>> thousands of documents.  The problem was that the tokens were
>>> distributed across a large portion of the index, so data locality  
>>> went
>>> down the drain and every update was like a full table scan.
>>>
>>> -scott
>>>
>>>
>>> On Wed, Dec 23, 2009 at 7:21 AM, Max Vlasov <max.vla...@gmail.com>  
>>> wrote:
>>>> Scott, thank,
>>>>
>>>> I see that the hit counts is the next level of complexity.
>>>> As for tokens, I think that if one really needs to implement this
>>> feature,
>>>> there would be a possible solution in allowing the program to  
>>>> "mirror"
>>>> tokenized" data. Right now as I see it would be possible (if a  
>>>> custom
>>>> tokenizer implemented), to collect such data inside xNext  
>>>> function, but
>>>> there are problems. For example, we should know we're not parsing  
>>>> Match
>>>> query. Is there a way to distinguish  whether xNext works with  
>>>> MATCH
>>>> operator content or from UPDATE or INSERT INTO? Because it makes  
>>>> no sense
>>> to
>>>> collect data from MATCH, because it may contain irrelevant, non- 
>>>> exising
>>>> words (but ironically it would be helpful for collecting "hit"  
>>>> data). If
>>> we
>>>> know inside xNext that the call from a real data appending then  
>>>> at least
>>> we
>>>> will be able to maintain a dictionary of words used at least once.
>>>>
>>>> Max
>>>>
>>>> On Wed, Dec 23, 2009 at 5:54 PM, Scott Hess <sh...@google.com>  
>>>> wrote:
>>>>
>>>>> The tokens are all there, so it is "theoretically possible".   
>>>>> But the
>>>>> performance would be very bad because it would require iterating  
>>>>> over
>>>>> all the segment data, basically a worst-case query.  Even then,  
>>>>> you
>>>>> wouldn't have information about hit counts.  To do it reasonably  
>>>>> would
>>>>> require redesigning the data format to take this use-case into
>>>>> consideration.
>>>>>
>>>>> -scott
>>>>>
>>>>> On Wed, Dec 23, 2009 at 3:56 AM, Max Vlasov <max.vla...@gmail.com>
>>> wrote:
>>>>>> Is it theoretically possible to allow some kind of sql SELECT  
>>>>>> query
>>>>>> operations with tokens?
>>>>>>
>>>>>> Particulary one could implement suggestions similar to Google
>>> suggestions
>>>>> in
>>>>>> Web search, when for example "some" typed in a search box and a
>>> listbox
>>>>>> popped up with this database tokens starting with these letters  
>>>>>> shown
>>>>>> ("someone", "something" etc).
>>>>>> Having some experience with full-text in past once I did my own  
>>>>>> simple
>>>>>> implementation of full-text for sqlite, where all tokens saved  
>>>>>> in a
>>>>>> separated table that also has a text index, so in this case  
>>>>>> SELECT
>>> Title
>>>>>> FROM Tokens WHERE Title LIKE "some%" works as expected (with
>>>>>> case-sensitivity pragma effectively set). I tried to read the
>>> technical
>>>>> part
>>>>>> of http://www.sqlite.org/fts3.html document, but could not  
>>>>>> figure out
>>>>>> whether it is possible to implement this in the current  
>>>>>> implementation
>>> of
>>>>>> fts3
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Max
>>>>>> _______________________________________________
>>>>>> sqlite-users mailing list
>>>>>> sqlite-users@sqlite.org
>>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>>>>
>>>>> _______________________________________________
>>>>> sqlite-users mailing list
>>>>> sqlite-users@sqlite.org
>>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>>>
>>>> _______________________________________________
>>>> sqlite-users mailing list
>>>> sqlite-users@sqlite.org
>>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>>
>>> _______________________________________________
>>> sqlite-users mailing list
>>> sqlite-users@sqlite.org
>>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>>
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] SELECT from fts3 tokens, is it possible?

Reply via email to