Markus,
it'd be nice if you (or somebody) wrtite a note about unicode, so it
could be added to tsearch2 documentation. It will help people and save
time and hair :)
Oleg
On Mon, 22 Nov 2004, Markus Wollny wrote:
Hi!
I dug through my list-archives - I actually used to have the very same problem
that you described: special chars being swallowed by tsearch2-functions. The
source of the problem was that I had INITDB'ed my cluster with [EMAIL
PROTECTED] as locale, whereas my databases used Unicode encoding. This does not
work correctly. I had to dump, initdb to the correct UTF-8-locale (de_DE.UTF-8
in my case) and reload to get tsearch2 to work correctly. You may find the
original discussion here:
http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php
If you wish to find out which locale was used during INITDB for your cluster,
you may use the pg_controldata program that's supplied with PostgreSQL.
Kind regards
Markus
-----Ursprüngliche Nachricht-----
Von: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Im Auftrag von
Dawid Kuroczko
Gesendet: Mittwoch, 17. November 2004 17:17
An: Pgsql General
Betreff: [GENERAL] Tsearch2 and Unicode?
I'm trying to use tsearch2 with database which is in
'UNICODE' encoding.
It works fine for English text, but as I intend to search
Polish texts I did:
insert into pg_ts_cfg('default_polish', 'default',
'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as
written in manual).
However, Polish-specific chars are being eaten alive, it seems.
I.e. doing select to_tsvector('default_polish', body) from
messages; results in list of words but with national chars stripped...
I wonder, am I doing something wrong, or just tsearch2
doesn't grok Unicode, despite the locales setting? This also
is a good question regarding ispell_dict and its feelings
regarding Unicode, but that's another story.
Assuming Unicode unsupported means I should perhaps... oh,
convert the data to iso8859 prior feeding it to_tsvector()...
interesting idea, but so far I have failed to actually do
it. Maybe store the data as 'bytea' and add a column with
encoding information (assuming I don't want to recreate whole
database with new encoding, and that I want to use unicode
for some columns (so I don't have to keep encoding with every
text everywhere...).
And while we are at it, how do you feel -- an extra column
with tsvector and its index -- would it be OK to keep it away
from my data (so I can safely get rid of them if need be)?
[ I intend to keep index of around 2 000 000 records, few KBs
of text each ]...
Regards,
Dawid Kuroczko
---------------------------(end of
broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faqs/FAQ.html
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])