Hi John: On Sun, Sep 30, 2012 at 11:45 PM, john knightley <john.knight...@gmail.com> wrote: > Dear Dan, > > thank you for your reply. > > The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on > a utf8 local > > A short 5 line dictionary file is sufficient to test:- > > raeuz > 我们 > 𦘭𥎵 > 𪽖𫖂 > > > line 1 "raeuz" Zhuang word written using English letters and show up > under ts_vector ok > line 2 "我们" uses everyday Chinese word and show up under ts_vector ok > line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters > found in Unicode 3.1 which came in about the year 2000 and show up > under ts_vector ok > line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters > found in Unicode 5.2 which came in about the year 2009 but do not show > up under ts_vector ok > line 5 "" Zhuang word written using rather old Chinese charcters > found in PUA area of the font Sawndip.ttf but do not show up under > ts_vector ok (Font can be downloaded from > http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) > > The last two words even though included in a dictionary do not get > accepted by ts_vector.
Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to work using the default text search configuration (albeit with one crucial note: I created the database with the "lc_ctype=C lc_collate=C" options): WORKING: createdb --template=template0 --lc-ctype=C --lc-collate=C foobar foobar=# select ts_debug(''); ts_debug ---------------------------------------------------------------- (word,"Word, all letters",,{english_stem},english_stem,{}) (1 row) NOT WORKING AS EXPECTED: foobaz=# SHOW LC_CTYPE; lc_ctype ------------- en_US.UTF-8 (1 row) foobaz=# select ts_debug(''); ts_debug --------------------------------- (blank,"Space symbols",,{},,) (1 row) So... perhaps LC_CTYPE=C is a possible workaround for you? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers