Re: [GENERAL] Problem with langage encoding

2008-12-01 Thread Craig Ringer
Stephane Bortzmeyer wrote:
 On Thu, Nov 27, 2008 at 02:34:17AM +0900,
  Craig Ringer [EMAIL PROTECTED] wrote 
  a message of 188 lines which said:
 
 ERROR:  character 0xc3bd of encoding UTF8 has no equivalent in WIN1251
 Which it does not; that character is ??? (HANGUL SYLLABLE SSYEG) 
 
 No, I don't think so. I think that 0xc3bd is the binary value, and
 0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is
 less surprising. It is in Latin-1 but not in WIN1251.

Yes, that would make a lot more sense. From there it's easy, see below:

 In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if
 the database was really in Latin-1.

OK, so what we end up with is the following sequence (expressed in
Python, which is always handy for encoding conversions etc):

print \xc3\xbd.decode(utf-8).encode(latin-1).decode(koi8_r)

to yield: Щ

Assuming that's right, what presumably happened is that the database was
initialized with UTF-8 and loaded with data it was told was in the
latin-1 encoding, but was actually in the  koi8_r encoding. Therefore,
utf-8 sequences were generated based on the interpretation of the bytes
for each koi8_r character as the latin-1 character for the same byte
value, so:

Input file: 0xfd (latin-1: ý, koi8_r: Щ)
   |
   | (input interpreted as latin-1)
   v
Database:   0xc3bd (utf-8: ý)

To recover the data you must reverse that process. Thankfully it's going
to be 100% reversible, ie no information has been lost.

To create a tiny test table for the following explanation and fix code I
just ran:

create table ss ( x text) ;
insert into ss (x) values (E'\xc3\xbd');

Now if I:

set client_encoding = WIN1251;

I get:

test= select * from ss;
ERROR:  character 0xc3bd of encoding UTF8 has no equivalent in WIN1251

just like you.

With client_encoding set as utf-8 (which is what my machine is) I get:

test= select * from ss;
 x
---
 ý
(1 row)



PostgreSQL's convert() function:
http://www.postgresql.org/docs/current/static/functions-string.html

may be used now to transform your data. It doesn't assume any encoding
for the input string, unlike convert_from and convert_to, so you can use
a statement like this to convert your data:(where 'x' is the string of
mangled data to be converted):

select (
convert_from(
  convert(x::bytea, 'utf-8', 'latin-1'),
  'koi8_r')
) from ss;

In other words: Decode the utf-8 sequence in the input and map each
utf-8 code point to the corresponding latin-1 character, outputting one
byte per latin-1 character. Interpret the sequence of bytes just
produced as a sequence of characters in the koi8_r encoding, and map
them to the same characters in the database's internal encoding.

To copy the converted data to a new table:

CREATE TABLE converted (y text);
INSERT INTO converted(y)
SELECT (
convert_from(
  convert(x::bytea, 'utf-8', 'latin-1'),
  'koi8_r')
) from ss;

Now if I SELECT from the table of converted data, I get the right(?) output:

test= select * from converted;
 y
---
 Щ
(1 row)

You can easily wrap that up in a simple SQL function:

CREATE OR REPLACE FUNCTION fixstr(text) RETURNS text AS $$
SELECT convert_from(convert($1::bytea, 'utf-8', 'latin-1'),'koi8_r')
$$ LANGUAGE 'SQL' IMMUTABLE;

so you can just:

test= select fixstr(x) from ss;
 fixstr

 Щ
(1 row)

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Problem with langage encoding

2008-11-28 Thread Stephane Bortzmeyer
On Thu, Nov 27, 2008 at 02:34:17AM +0900,
 Craig Ringer [EMAIL PROTECTED] wrote 
 a message of 188 lines which said:

  ERROR:  character 0xc3bd of encoding UTF8 has no equivalent in WIN1251
 
 Which it does not; that character is ??? (HANGUL SYLLABLE SSYEG) 

No, I don't think so. I think that 0xc3bd is the binary value, and
0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is
less surprising. It is in Latin-1 but not in WIN1251.

In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if
the database was really in Latin-1.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Problem with langage encoding

2008-11-26 Thread Ronald Vyhmeister
I've inherited a database system (including source)...  It's Russian text
(which I don't speak, but I need to work on it).  

On my screen it shows correctly, but in the database it shows a mess...
from trying with a web browser.  It appears that the data in the database is
encoded with ISO-8859-1 (my browser and the database viewed with PGAdmin
match when my browser has ISO-8859-1 encoding)  and it shows correctly
on screen in my browser when I use the WIN1251 encoding...

The problem is that the database is UTF8, and it won't let me use the
convert_to function because it says that the characters don't exist... What
I need to do is to pull the UTF8 from the database, tell postgres that it's
8859-1, and then convert it to WIN1251... How?  

Thank you for any assistance!

Ron




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Problem with langage encoding

2008-11-26 Thread Raymond O'Donnell
On 26/11/2008 14:40, Ronald Vyhmeister wrote:

 The problem is that the database is UTF8, and it won't let me use the
 convert_to function because it says that the characters don't exist... What
 I need to do is to pull the UTF8 from the database, tell postgres that it's
 8859-1, and then convert it to WIN1251... How?  

If the DB is in UTF8, you ought to be able to issue

  set client_encoding to 'WIN1251'

after connection, and the conversion happens automatically. See:

  http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483

HTH,

Ray.


--
Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland
[EMAIL PROTECTED]
Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals
--

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Problem with langage encoding

2008-11-26 Thread Ronald Vyhmeister


 The problem is that the database is UTF8, and it won't let me use the
 convert_to function because it says that the characters don't exist...
What
 I need to do is to pull the UTF8 from the database, tell postgres that
it's
 8859-1, and then convert it to WIN1251... How?  

If the DB is in UTF8, you ought to be able to issue

  set client_encoding to 'WIN1251'

after connection, and the conversion happens automatically. See:

  http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483

HTH,

Ray.

Wish it would work...  when I do it, I get:

ERROR:  character 0xc3bd of encoding UTF8 has no equivalent in WIN1251
** Error **
ERROR: character 0xc3bd of encoding UTF8 has no equivalent in WIN1251
SQL state: 22P05

The DB is storing it UTF8, but it is really 8859-1...  I need to force PG to
think that it's 8859-1 in spite of what it thinks it may be...

Ron


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Problem with langage encoding

2008-11-26 Thread Tom Lane
Ronald Vyhmeister [EMAIL PROTECTED] writes:
 The DB is storing it UTF8, but it is really 8859-1...  I need to force PG to
 think that it's 8859-1 in spite of what it thinks it may be...

Dump the database, change the set client_encoding command in the
resulting file, reload.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Problem with langage encoding

2008-11-26 Thread Craig Ringer
Ronald Vyhmeister wrote:
 
 The problem is that the database is UTF8, and it won't let me use the
 convert_to function because it says that the characters don't exist...
 What
 I need to do is to pull the UTF8 from the database, tell postgres that
 it's
 8859-1, and then convert it to WIN1251... How?  
 
 If the DB is in UTF8, you ought to be able to issue
 
  set client_encoding to 'WIN1251'
 
 after connection, and the conversion happens automatically. See:
 
  http://www.postgresql.org/docs/8.3/static/multibyte.html#AEN27483
 
 HTH,
 
 Ray.
 
 Wish it would work...  when I do it, I get:
 
 ERROR:  character 0xc3bd of encoding UTF8 has no equivalent in WIN1251

Which it does not; that character is 쎽 (HANGUL SYLLABLE SSYEG) which
certainly isn't in WIN1251 or in latin-1 (ISO-8859-1). The byte sequence
for this character in UTF-8 is:

0xec 0x8e 0xbd

When decoded as latin-1, those three bytes are interpreted as: 쎽 ...
which doesn't seem much more likely to be valid. Are you SURE it's
latin-1, not (say) Koi_r-8?



Python is a useful tool for investigating encoding problems. It draws a
strong distinction between true Unicode strings and byte strings. It
provides quality conversion routines that let you specify the encoding
of the byte string, and that'll throw an exception for nonsensical input.

This lets you take a byte string, decode it according to various
encodings, and see what you land up with. You can also test various
horrible encoding mangling schemes used by apps rather easily. The
python `encodings' module contains a list of all known encodings by all
aliases as `encodings.aliases.aliases'. Here's some code that takes an
input byte string and prints a table of all ways it can be interpreted
according to various encodings:

---
import sys
import encodings

stupid_encodings = ['zlib_codec', 'uu_codec', 'hex_codec', 'bz2_codec']

bs = '\xc3\xbd'
for encoding in set(encodings.aliases.aliases.values()):
  sys.stdout.write(u%20s:  % encoding)
  sys.stdout.flush()
  if encoding in stupid_encodings:
continue
  try:
print u\%6s\ (%6s) % (bs.decode(encoding),
repr(bs.decode(encoding)))
  except UnicodeDecodeError,e:
print u [INVALID]
  except LookupError,e:
print u [UNSUPPORTED]




... and here's the output for the data you mentioned:



   bz2_codec:   cp1140: C¨ (u'C\xa8')
  rot_13: ý (u'\xc3\xbd')
   cp932: テス (u'\uff83\uff7d')
euc_jisx0213:  箪 (u'\u7baa')
   cp037: C¨ (u'C\xa8')
   hex_codec:cp500: C¨ (u'C\xa8')
uu_codec:big5hkscs:  羸 (u'\u7fb8')
mbcs:  [UNSUPPORTED]
euc_jis_2004:  箪 (u'\u7baa')
iso2022_jp_3:  [INVALID]
iso2022_jp_2:  [INVALID]
iso2022_jp_1:  [INVALID]
 gbk:  媒 (u'\u5a92')
 iso2022_jp_2004:  [INVALID]
quopri_codec:  [INVALID]
   cp424: C¨ (u'C\xa8')
  iso2022_jp:  [INVALID]
 mac_iceland: √Ω (u'\u221a\u03a9')
   hp_roman8: û§ (u'\xfb\xa7')
  iso2022_kr:  [INVALID]
  euc_kr:  첵 (u'\uccb5')
  cp1254: ý (u'\xc3\xbd')
  gb2312:  媒 (u'\u5a92')
   cp850: Û (u'\u251c\xa2')
   shift_jis: テス (u'\uff83\uff7d')
   cp852: ├Ż (u'\u251c\u017b')
   cp855: ├й (u'\u251c\u0439')
   utf_16_le:  뷃 (u'\ubdc3')
   cp857: Û (u'\u251c\xa2')
   cp775: ├Į (u'\u251c\u012e')
  cp1026: C¨ (u'C\xa8')
  mac_latin2: √Ĺ (u'\u221a\u0139')
mac_cyrillic: √љ (u'\u221a\u0459')
base64_codec:('')
 ptcp154: ГҪ (u'\u0413\u04aa')
  euc_jp:  箪 (u'\u7baa')
  hz:  [INVALID]
   utf_8:  ý (u'\xfd')
   mac_greek: ΟΫ (u'\u039f\u03ab')
   utf_7:  [INVALID]
 mac_turkish: √Ω (u'\u221a\u03a9')
   cp949:  첵 (u'\uccb5')
  zlib_codec: big5:  羸 (u'\u7fb8')
   iso8859_9: ý (u'\xc3\xbd')
   iso8859_8:  [INVALID]
   iso8859_5: УН (u'\u0423\u041d')
   iso8859_4: ÃŊ (u'\xc3\u014a')
   iso8859_7: Γ½ (u'\u0393\xbd')
   iso8859_6:  [INVALID]
   iso8859_3:  [INVALID]
   iso8859_2: Ă˝ (u'\u0102\u02dd')
 gb18030:  媒 (u'\u5a92')
  shift_jis_2004: テス (u'\uff83\uff7d')
   mac_roman: √Ω (u'\u221a\u03a9')
   cp950:  羸 (u'\u7fb8')
  utf_16:  뷃 (u'\ubdc3')
  iso8859_15: Ü (u'\xc3\u0153')
  iso8859_14: ÃẄ (u'\xc3\u1e84')
 tis_620: รฝ