Re: [sqlite] Diacritics handling in FTS with a custom tokenizer

Dan Kennedy Wed, 08 Feb 2012 08:44:47 -0800

On 02/08/2012 11:34 PM, George Ionescu wrote:

Hello all,
I would like to know how are diacritics handled in FTS, specifically if I
can index text with diacritics and search for terms without them.


For example, given the queries

CREATE VIRTUAL TABLE fts_pages USING fts4(tokenize=snowball ro_RO);
  INSERT INTO fts_pages (docid,content) VALUES (1, 'România este o ţară
frumoasă');

the search
SELECT COUNT(1) FROM fts_pages WHERE content MATCH 'este'
returns 1,

but the next search
SELECT COUNT(1) FROM fts_pages WHERE content MATCH 'Romania'
returns 0.

The tokenizer I'm using is based on snowball and can be found at
https://bitbucket.org/sevkin/snowball_fts3


The custom tokenizer needs to normalize the tokens. So when it
parses "România" it should return "romania" (with no diacritic)
to FTS. Then when you query for "romania", it will match.

Note that the custom tokenizer is also used to tokenize queries
as well as documents. So if I query for "România", the tokenizer
will normalize the query term to "romania" as well - which will
match the normalized entry in the index.

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Diacritics handling in FTS with a custom tokenizer

Reply via email to