Hi,

The FTS simple tokenizer has an undocumented feature, which allows the set
of characters it treats as delimiters to be configured.

By default it simply treats all non-alphanumeric ASCII characters as
delimiters, but the following example shows how it can be customized to use
only '#' (hash) or ' ' (space) as delimiters:

CREATE VIRTUAL TABLE documents USING fts4(title, content, tokenize=simple
'' '# ');

Above, the first argument is an empty string (simple tokenizer ignores the
first argument), and the second argument is the list of delimiters to use.

There was a brief discussion on this list about the feature in 2012 [1].

Quote (regarding lack of documentation):
"Likely the reason is that we forgot that this feature even exists.  It
seems to have existed in the simple tokenizer, unchanged, since the
original introduction of FTS1 back in 2006."

Quote (regarding whether it's safe to use the feature):
"But it has been in the code for so long now that we dare not
change it for fear of breaking long-established programs."

...however it was also mentioned that the feature is not likely to have
been tested thoroughly.

Nonetheless, the relevant source code looks fairly straightforward [2].

In a current project, we are doing tokenization outside of SQLite. And so
the ability to tell SQLite which delimiter character we have used, without
needing to import ICU etc. is very appealing.

Can we document this hidden feature?

Best regards,
Niall Gallagher

[1] Previous discussion:
http://article.gmane.org/gmane.comp.db.sqlite.general/74199
[2] Source code:
http://www.sqlite.org/src/artifact/5c98225a53705e5ee34824087478cf477bdb7004
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to