[sqlite] Opinions about per-row tokenizers for fts?

Scott Hess Mon, 17 Sep 2007 15:47:52 -0700

As part of doing internationalization work on Gears, it has been
determined that it is unlikely that you can just define a global
tokenizer that will work for everything.  Instead, in some cases you
may need to use a specific tokenizer, based on the content being
tokenized, or the source of the content.  This can be emulated by
using multiple tables and complicating your joins, but it would be
nicer if fts could just accommodate this use case.


In the interests of not committing something that people won't like,
my current proposal would be to add an implicit TOKENIZER column,
which will override the table's default tokenizer for that row.  So,
you could do something like:

   CREATE VIRTUAL TABLE t USING fts3(TOKENIZER icu(en), content);
   INSERT INTO t VALUES ('testing testing');  -- Uses icu(en).
   INSERT INTO t (tokenizer, content) VALUES ('icu(kr)', '나의');  --
Uses icu(kr).
   SELECT rowid FROM t WHERE t MATCH 'TOKENIZER:icu(kr) 의';

[Forgive me if you can read Korean and I just did something offensive.
 I'm doing copy/paste, here!]

fts allows for anything starting with 'tokenize' in that location in
the CREATE statement, so in the above all uses must match.  If you
used "TOKENIZE" in the create, you use "TOKENIZE" everywhere else.  In
MATCH, it must be the uppercase term use from the create (the other
places are case-insensitive), followed by : followed by a valid
tokenizer name followed by an optional parameter list.

A variant which I think is somewhat interesting would be:

  CREATE VIRTUAL TABLE t USING fts3(tokenizer TOKENIZER DEFAULT
icu(en), content);

This makes the "tokenizer" column a bit more explicit, and the
'DEFAULT ...' syntax makes it clear what's going on, but I couldn't
really think of any other sensible name for the column, so it also
feels redundent.  Since 'tokenize' is already a reserved prefix for
fts, I'm inclined towards the first variant.

Opinions?

-scott

[sqlite] Opinions about per-row tokenizers for fts?

Reply via email to