Re: [GENERAL] TSearch2 / Get all unique lexems

Teodor Sigaev Thu, 08 Dec 2005 03:01:51 -0800

Thanks. I hoped for something possible inside a pl/pgsql proc. I'mtrying to integrate pg_trgm with Tsearch2. I'm still on my UTF-8database. Yes I know, there is _NO_ UTF-8 support of any kind inTsearch2 yet, but I got it working to a degree that is OK for myapplication (Created my own stemmer variant, ispell dict, affix fileetc). The last missing bit is to get a source for pg_trgm. I cannot usethe the stat() function, because it breaks as soon it sees an UTF-8 char.

I suppose noncompatible with UTF wordparser can produce illegal lexemes (withpart of multibyte char) and stores it in tsvector. Tsvector hasn't any controlof breakness lexemes (with a help pg_verifymbstr() call), but stat() makes textfield and then postgres check it and found incomplete mbchars. Which way I see(except waiting UTF support in tsearch2 which we develop now):

1 modify stat() function to check text field and if it fails then remove lexemefrom output

2 Take from CVS HEAD wordpaser (ts_locale.[ch], wparser_def.c,wordparser/parser.[ch]). to_tsvector will works fine, to_tsquery will workscorrect only with quoted string (for examle, 'foo' & 'bar', bad: foo & bar).But casting 'asasas'::tsvector and dump/reload will not work correct.





--
Teodor Sigaev                                   E-mail: [EMAIL PROTECTED]
                                                   WWW: http://www.sigaev.ru/

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [GENERAL] TSearch2 / Get all unique lexems

Reply via email to