Zsolt Branyiczky wrote:
Hi,

I installed mysql-essential-5.0.18-win32.msi (under Win XP) in order to use it with Bugzilla. Since I had some problem with importing old Bugzilla dumps containing some special hungarian vowels (they are outside of latin1 characterset), I checked how mysql stores these characters (vowels) inside.

I stored the following string (entering a new bug in Bugzilla) into database (hopefully you can read it):
áéíóöőúüűÁÉÍÓÖŐÚÜŰ
These 9-9 vowels are used in the hungarian alpabet (small and capital letters).

To my great surprise in mysql its hexadump is (select hex(column)):
C383C2A1C383C2A9C383C2ADC383C2B3C383C2B6C385E28098C383C2BAC383C2BCC385C2B1C383C281C383E280B0C383C28DC383E2809CC383E28093C385C290C383C5A1C383C593C385C2B0
It is longer (152 characters) that I expected (18x2x2 = 72 characters).

Here I cut the string into as many lines as vowels, moreover I placed into brackets the UTF8 codes that should be used normally instead of them.

C383C2A1 (C3A1) LATIN SMALL LETTER A WITH ACUTE
C383C2A9 (C3A9) LATIN SMALL LETTER E WITH ACUTE
C383C2AD (C3AD) LATIN SMALL LETTER I WITH ACUTE
C383C2B3 (C3B3) LATIN SMALL LETTER O WITH ACUTE
C383C2B6 (C3B6) LATIN SMALL LETTER O WITH DIAERESIS
C385E28098 (C591) LATIN SMALL LETTER O WITH DOUBLE ACUTE
C383C2BA (C3BA) LATIN SMALL LETTER U WITH ACUTE
C383C2BC (C3BC) LATIN SMALL LETTER U WITH DIAERESIS
C385C2B1 (C5B1) LATIN SMALL LETTER U WITH DOUBLE ACUTE
C383C281 (C381) LATIN CAPITAL LETTER A WITH ACUTE
C383E280B0 (C389) LATIN CAPITAL LETTER E WITH ACUTE
C383C28D (C38D) LATIN CAPITAL LETTER I WITH ACUTE
C383E2809C (C393) LATIN CAPITAL LETTER O WITH ACUTE
C383E28093 (C396) LATIN CAPITAL LETTER O WITH DIAERESIS
C385C290 (C590) LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
C383C5A1 (C39A) LATIN CAPITAL LETTER U WITH ACUTE
C383C593 (C39C) LATIN CAPITAL LETTER U WITH DIAERESIS
C385C2B0 (C5B0) LATIN CAPITAL LETTER U WITH DOUBLE ACUTE

The letters should be consumed just 2 bytes, but in mysql all of them uses 4 or 5(!!!) bytes. Does MySql 5.0 store the UTF8 codes not directly? Since Bugzilla displays the entered bug correctly (it reads the vowels properly from database), I think there must be some rule in mysql storing utf8 codes. When I make a (mysql)dump from the database, also these long codes are inside of the dump.

The old dumps were made still with mysql-essential-4.1.12-win32.msi and they contain proper UTF8 codes (with 2 bytes length per hungarian special vowel), so I do not know why MySQL 5.0 stores these vowels differently than MySQL 4.1.

Regards
Zsolt

I think I know the cause of the problem. Bugzilla somehow encodes (and decodes) twice the characters.
E.g. E1 (LATIN SMALL LETTER A WITH ACUTE) -> C3 A1 -> C3 83 C2 A1
So not MySQL is repsonsible for the strange phenomenon, I'm sorry for my previous post.

Zsolt

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to