Re: [sqlite] What encoding format is used in the FTS3 tokenizer? and other tokenizer questions.

Scott Hess Thu, 21 May 2009 16:29:17 -0700

On Mon, May 4, 2009 at 2:22 PM, Garry Watkins <ga...@dynafocus.com> wrote:
> What encoding is used on inbound insert statements into a FTS3 virtual
> table?  For example I have Japanese text encoded as UTF-8 and passed
> in as UTF-8 insert statement is encoded as UTF-8.  I am not using the
> ICU library (SQLITE_ENABLE_ICU is not defined).


Should be UTF-8 across the board.

> In the above tokenizer I want to eliminate words (stemming), do I just
> not return those words, and move to the next?  Does this actually
> influence the text that is stored, or is this just used for indexing?

The original is stored as the document, the tokenizer results are used
to build the index and to construct queries.  So if you skip
stopwords, they won't be searchable using the fts MATCH operator.  It
should make the index smaller.  There may be query-time edge-cases,
for instance a phrase containing a stopword will probably match
occurances of the phrase words in the input regardless of whether
there were stopwords between them, or what those stopwords were.
Probably not a huge problem in practice.  [I mean that if "a" and
"the" are stopwords, then "catch a ball" would match documents
containing "catch the ball".]

You would probably implement stemming by returning a token which is
different from the literal word in question.  That should work, too,
the index is built from what the tokenizer returns, not from the
input.  You indicate the part of the input that the token represents
by the offsets the tokenizer returns by reference.  So the tokenizer
can return offsets corresponding to "corresponding", but return the
token "correspond" which isn't actually the same length, and
"correspond" will be what goes into the index.

I should note that I haven't worked in the code for a long time,
things may have changed.

> I am building a static library, and I want the tokenizer to be
> available for anything that I link it with.  Is there an easy way to
> make that happen?  I currently configured it through adding code into
> the core FTS3 code.  I don't want to do this each time, and would like
> to keep the source separated, so I can easily change versions of SQLite.

Here, I'm a little wishy-washy.  For Google Gears, we just wired stuff
in via compile-time arguments, I think it was something like
-DSQLITE_ENABLE_FTS3 (actually, Gears uses fts2, because VACUUM was
explicitly disabled so that design flaw couldn't happen.  You should
use fts3, but things should work the same).  Gears has
third_party/sqlite_vendor and third_party/sqlite_google, so you should
be able to trivially verify what changes Gears had from SQLite CVS as
of the time things were last imported.

Google Chrome also uses fts, but I'm not entirely sure on the
mechanics, there, so you'll have to poke around on your own.

-scott
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] What encoding format is used in the FTS3 tokenizer? and other tokenizer questions.

Reply via email to