On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <deni...@gmail.com> wrote: > Hi John: > > On Sun, Sep 30, 2012 at 11:45 PM, john knightley > <john.knight...@gmail.com> wrote: >> Dear Dan, >> >> thank you for your reply. >> >> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on >> a utf8 local >> >> A short 5 line dictionary file is sufficient to test:- >> >> raeuz >> 我们 >> 𦘭𥎵 >> 𪽖𫖂 >> >> >> line 1 "raeuz" Zhuang word written using English letters and show up >> under ts_vector ok >> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok >> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters >> found in Unicode 3.1 which came in about the year 2000 and show up >> under ts_vector ok >> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters >> found in Unicode 5.2 which came in about the year 2009 but do not show >> up under ts_vector ok >> line 5 "" Zhuang word written using rather old Chinese charcters >> found in PUA area of the font Sawndip.ttf but do not show up under >> ts_vector ok (Font can be downloaded from >> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) >> >> The last two words even though included in a dictionary do not get >> accepted by ts_vector. > > Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to > work using the default text search configuration (albeit with one > crucial note: I created the database with the "lc_ctype=C > lc_collate=C" options): > > WORKING: > > createdb --template=template0 --lc-ctype=C --lc-collate=C foobar > foobar=# select ts_debug(''); > ts_debug > ---------------------------------------------------------------- > (word,"Word, all letters",,{english_stem},english_stem,{}) > (1 row) > > NOT WORKING AS EXPECTED: >
> > foobaz=# SHOW LC_CTYPE; > lc_ctype > ------------- > en_US.UTF-8 > (1 row) > > foobaz=# select ts_debug(''); > ts_debug > --------------------------------- > (blank,"Space symbols",,{},,) > (1 row) > > So... perhaps LC_CTYPE=C is a possible workaround for you? LC_CTYPE would not be a work around - this database needs to be in utf8 , the full text search is to be used for a mediawiki. Is this a bug that is being worked on? Regards John -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers