Re: [GENERAL] Tsearch2 and Unicode?

Oleg Bartunov Mon, 22 Nov 2004 05:50:28 -0800

Markus,

it'd be nice if you (or somebody) wrtite a note about unicode, so it
could be added to tsearch2 documentation. It will help people and save
time and hair :)


    Oleg
On Mon, 22 Nov 2004, Markus Wollny wrote:

Hi!

I dug through my list-archives - I actually used to have the very same problem 
that you described: special chars being swallowed by tsearch2-functions. The 
source of the problem was that I had INITDB'ed my cluster with [EMAIL 
PROTECTED] as locale, whereas my databases used Unicode encoding. This does not 
work correctly. I had to dump, initdb to the correct UTF-8-locale (de_DE.UTF-8 
in my case) and reload to get tsearch2 to work correctly. You may find the 
original discussion here: 
http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php
If you wish to find out which locale was used during INITDB for your cluster, 
you may use the pg_controldata program that's supplied with PostgreSQL.

Kind regards

  Markus

-----Ursprüngliche Nachricht-----
Von: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Im Auftrag von
Dawid Kuroczko
Gesendet: Mittwoch, 17. November 2004 17:17
An: Pgsql General
Betreff: [GENERAL] Tsearch2 and Unicode?

I'm trying to use tsearch2 with database which is in
'UNICODE' encoding.
It works fine for English text, but as I intend to search
Polish texts I did:

insert into pg_ts_cfg('default_polish', 'default',
'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as
written in manual).

However, Polish-specific chars are being eaten alive, it seems.
I.e. doing select to_tsvector('default_polish', body) from
messages; results in list of words but with national chars stripped...

I wonder, am I doing something wrong, or just tsearch2
doesn't grok Unicode, despite the locales setting?  This also
is a good question regarding ispell_dict and its feelings
regarding Unicode, but that's another story.

Assuming Unicode unsupported means I should perhaps... oh,
convert the data to iso8859 prior feeding it to_tsvector()...
 interesting idea, but so far I have failed to actually do
it.  Maybe store the data as 'bytea' and add a column with
encoding information (assuming I don't want to recreate whole
database with new encoding, and that I want to use unicode
for some columns (so I don't have to keep encoding with every
text everywhere...).

And while we are at it, how do you feel -- an extra column
with tsvector and its index -- would it be OK to keep it away
from my data (so I can safely get rid of them if need be)?
[ I intend to keep index of around 2 000 000 records, few KBs
of text each ]...

  Regards,
      Dawid Kuroczko

---------------------------(end of
broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faqs/FAQ.html


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
     subscribe-nomail command to [EMAIL PROTECTED] so that your
     message can get through to the mailing list cleanly


    Regards,
                Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [GENERAL] Tsearch2 and Unicode?

Reply via email to