[GENERAL] Character encoding problems

Bruce Clay Thu, 08 Dec 2011 23:36:42 -0800

Sorry for the duplicate postings.  I have only recieved one reply so far and 
that was a suggestion to post to this forum.


I trying to build a database to support natural language processing from a 
variety of data files posted on the internet.  Many of them are identified as 
using UTF-8 encoding.  Some of these are dictionary files fro WinEdt. Some are 
from an Open Source multi-lingual health care package.

When I try to build a table from several of the different languages I get the 
following error

ERROR: invalid byte sequence for encoding "UTF8": 0x82

I checked the encoding and it is indeed set up for Unicode-8. I tried to create 
databases using a variety of other encoding types such as WIN1252 and others 
and I got the same error message from all of them except SQL_ASCII.

When I created the database using SQL_ASCII I received the warning that the 
database could only store 7 bit data. When I loaded the data in this database I 
did not have any errors and when I look at the data it seems to be the same as 
in the original text file.

Is there a "proper" encoding type that I should use to load the word lists so 
they can be interoperable with the WordNet dataset that happily uses the UTF8 
encoding?

Bruce

-- 
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Character encoding problems

Reply via email to