I have a feeling that an issue I'm running into is related to this: http://archives.postgresql.org/pgsql-bugs/2008-06/msg00113.php

On Windows XP running PgAdmin III 1.8.4 against either PostgreSQL 8.3.0 or 8.3.3 DB, when attempting to do a:

select * from ts_debug('french', 'catalogue');

getting the following error:

ERROR:  invalid byte sequence for encoding "UTF8": 0xc3
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT:  SQL function "ts_debug" statement 1

I have replaced the french.stop file with the one from the snowball web site (http://snowball.tartarus.org/algorithms/french/stemmer.html) to see if that would make any difference. But the same issue. I have also attempted to load the French Hunspell dictionary from the Open Office web site (http://wiki.services.openoffice.org/wiki/Dictionaries), using the following command:

CREATE TEXT SEARCH DICTIONARY public.fr_ispell (
   TEMPLATE = pg_catalog.ispell,
   DictFile = fr_FR,
   AffFile = fr_FR,
   StopWords = french
);

But getting the same error. I have successfully loaded the English and Arabic dictionaries and an Arabic stop file I sourced from elsewhere, and they work fine with the various text search function calls, so it appears to be specifically related to a French character occurring in the stop file and the dictionaries. To use the French OO dictionaries, I had to convert them from an ISO-8859-15 character set encoding to UTF-8. As it still had the same result as with the packaged stop file when converting on Windows, I downloaded them and converted the encoding on a Linux machine before copying them across to windows to see if that would help, but it didn't.

However, if I run the ts_debug('french', 'catalogue'); against a Linux version of PostgreSQL 8.3.1, it works fine. I have not tried version 8.3.1 on Windows. While there are a lot more combinations to exhaust before I can make a categorical statement, at this stage it appears to be pointing towards an issue with the UTF-8 parser of PostgreSQL on Windows.

Is this an outstanding defect, or is there something that I'm doing wrong in my environment? I have attempted to find anything related on the Internet, but other than the introductory reference, I have not found anything, which for what I would imagine to be, of the size of the French user base surprises me. Hence, I'm thinking that perhaps it may be something in my environment causing the issue. If others could also reproduce the error on their XP machines, that would indicate that the issue was not something specific just to me.

At this stage, it is not that important to me, as I'm just playing around with text search for my own curiosity and French was just a language I have randomly picked, along with Arabic (for which I'm lacking a snowball stemmer). I don't actually read, much less speak those languages. However, it would still be nice to have them working.

An additional related topic. OO have for some languages, thesaurus files which are not in the same format as supported by Pg Full Text Search. Are there any plans to support the OO thesaurus file formats? They also have hyphenation files. Are there any plans to extend the current dictionary files to include hyphenation rules as captured in the OO hyphenation files? I'm not sure how, if at all hyphenation rules would improve on indexing and searches, but I thought as the files exist, I would pose the question.

Thanks,

Andy





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to