Doing this in the tokenizer is probably not great because you can't
determine which of the tokens actually are in the database after
accounting for deletes and updates.  When I was last thinking about
this problem, I think the best option looked like keeping a distinct
token index with token_id stored in the fts index.  Another
alternative was to rearrange how segments are stored and merged so
that the token part of the index would float above the doclists.

If you wanted to hack something in right now, you could keep a set of
parallel segments mapping tokens to counts of the number of hits for
that token.  A document insert would generate a bunch of positive
counts, a document delete a bunch of negative counts, merges would net
them out, and if a count went to 0 that token would be dropped from
that segment.  Queries would still be somewhat costly, but this index
could probably be merged more aggressively than the fts index (it
should be much smaller).

If your database isn't too large, you could maybe even keep this in a
regular SQLite-style table.  Maybe, maybe not, fts1 did something like
that and it got terribly slow once the fts table had a few tens of
thousands of documents.  The problem was that the tokens were
distributed across a large portion of the index, so data locality went
down the drain and every update was like a full table scan.

-scott


On Wed, Dec 23, 2009 at 7:21 AM, Max Vlasov <max.vla...@gmail.com> wrote:
> Scott, thank,
>
> I see that the hit counts is the next level of complexity.
> As for tokens, I think that if one really needs to implement this feature,
> there would be a possible solution in allowing the program to "mirror"
> tokenized" data. Right now as I see it would be possible (if a custom
> tokenizer implemented), to collect such data inside xNext function, but
> there are problems. For example, we should know we're not parsing Match
> query. Is there a way to distinguish  whether xNext works with MATCH
> operator content or from UPDATE or INSERT INTO? Because it makes no sense to
> collect data from MATCH, because it may contain irrelevant, non-exising
> words (but ironically it would be helpful for collecting "hit" data). If we
> know inside xNext that the call from a real data appending then at least we
> will be able to maintain a dictionary of words used at least once.
>
> Max
>
> On Wed, Dec 23, 2009 at 5:54 PM, Scott Hess <sh...@google.com> wrote:
>
>> The tokens are all there, so it is "theoretically possible".  But the
>> performance would be very bad because it would require iterating over
>> all the segment data, basically a worst-case query.  Even then, you
>> wouldn't have information about hit counts.  To do it reasonably would
>> require redesigning the data format to take this use-case into
>> consideration.
>>
>> -scott
>>
>> On Wed, Dec 23, 2009 at 3:56 AM, Max Vlasov <max.vla...@gmail.com> wrote:
>> > Is it theoretically possible to allow some kind of sql SELECT query
>> > operations with tokens?
>> >
>> > Particulary one could implement suggestions similar to Google suggestions
>> in
>> > Web search, when for example "some" typed in a search box and a listbox
>> > popped up with this database tokens starting with these letters shown
>> > ("someone", "something" etc).
>> > Having some experience with full-text in past once I did my own simple
>> > implementation of full-text for sqlite, where all tokens saved in a
>> > separated table that also has a text index, so in this case SELECT Title
>> > FROM Tokens WHERE Title LIKE "some%" works as expected (with
>> > case-sensitivity pragma effectively set). I tried to read the technical
>> part
>> > of http://www.sqlite.org/fts3.html document, but could not figure out
>> > whether it is possible to implement this in the current implementation of
>> > fts3
>> >
>> > Thanks
>> >
>> > Max
>> > _______________________________________________
>> > sqlite-users mailing list
>> > sqlite-users@sqlite.org
>> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>> >
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to