mysql 5.0 and utf8 letters inside

2006-03-03 Thread Zsolt Branyiczky

Hi,

I installed mysql-essential-5.0.18-win32.msi (under Win XP) in order to 
use it with Bugzilla. Since I had some problem with importing old 
Bugzilla dumps containing some special hungarian vowels (they are 
outside of latin1 characterset), I checked how mysql stores these 
characters (vowels) inside.


I stored the following string (entering a new bug in Bugzilla) into 
database (hopefully you can read it):

áéíóöőúüűÁÉÍÓÖŐÚÜŰ
These 9-9 vowels are used in the hungarian alpabet (small and capital 
letters).


To my great surprise in mysql its hexadump is (select hex(column)):
C383C2A1C383C2A9C383C2ADC383C2B3C383C2B6C385E28098C383C2BAC383C2BCC385C2B1C383C281C383E280B0C383C28DC383E2809CC383E28093C385C290C383C5A1C383C593C385C2B0
It is longer (152 characters) that I expected (18x2x2 = 72 characters).

Here I cut the string into as many lines as vowels, moreover I placed 
into brackets the UTF8 codes that should be used normally instead of them.


C383C2A1 (C3A1) LATIN SMALL LETTER A WITH ACUTE
C383C2A9 (C3A9) LATIN SMALL LETTER E WITH ACUTE
C383C2AD (C3AD) LATIN SMALL LETTER I WITH ACUTE
C383C2B3 (C3B3) LATIN SMALL LETTER O WITH ACUTE
C383C2B6 (C3B6) LATIN SMALL LETTER O WITH DIAERESIS
C385E28098 (C591) LATIN SMALL LETTER O WITH DOUBLE ACUTE
C383C2BA (C3BA) LATIN SMALL LETTER U WITH ACUTE
C383C2BC (C3BC) LATIN SMALL LETTER U WITH DIAERESIS
C385C2B1 (C5B1) LATIN SMALL LETTER U WITH DOUBLE ACUTE
C383C281 (C381) LATIN CAPITAL LETTER A WITH ACUTE
C383E280B0 (C389) LATIN CAPITAL LETTER E WITH ACUTE
C383C28D (C38D) LATIN CAPITAL LETTER I WITH ACUTE
C383E2809C (C393) LATIN CAPITAL LETTER O WITH ACUTE
C383E28093 (C396) LATIN CAPITAL LETTER O WITH DIAERESIS
C385C290 (C590) LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
C383C5A1 (C39A) LATIN CAPITAL LETTER U WITH ACUTE
C383C593 (C39C) LATIN CAPITAL LETTER U WITH DIAERESIS
C385C2B0 (C5B0) LATIN CAPITAL LETTER U WITH DOUBLE ACUTE

The letters should be consumed just 2 bytes, but in mysql all of them 
uses 4 or 5(!!!) bytes. Does MySql 5.0 store the UTF8 codes not 
directly? Since Bugzilla displays the entered bug correctly (it reads 
the vowels properly from database), I think there must be some rule in 
mysql storing utf8 codes. When I make a (mysql)dump from the database, 
also these long codes are inside of the dump.


The old dumps were made still with mysql-essential-4.1.12-win32.msi and 
they contain proper UTF8 codes (with 2 bytes length per hungarian 
special vowel), so I do not know why MySQL 5.0 stores these vowels 
differently than MySQL 4.1.


Regards
Zsolt

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: mysql 5.0 and utf8 letters inside

2006-03-03 Thread Zsolt Branyiczky

Zsolt Branyiczky wrote:

Hi,

I installed mysql-essential-5.0.18-win32.msi (under Win XP) in order to 
use it with Bugzilla. Since I had some problem with importing old 
Bugzilla dumps containing some special hungarian vowels (they are 
outside of latin1 characterset), I checked how mysql stores these 
characters (vowels) inside.


I stored the following string (entering a new bug in Bugzilla) into 
database (hopefully you can read it):

áéíóöőúüűÁÉÍÓÖŐÚÜŰ
These 9-9 vowels are used in the hungarian alpabet (small and capital 
letters).


To my great surprise in mysql its hexadump is (select hex(column)):
C383C2A1C383C2A9C383C2ADC383C2B3C383C2B6C385E28098C383C2BAC383C2BCC385C2B1C383C281C383E280B0C383C28DC383E2809CC383E28093C385C290C383C5A1C383C593C385C2B0 


It is longer (152 characters) that I expected (18x2x2 = 72 characters).

Here I cut the string into as many lines as vowels, moreover I placed 
into brackets the UTF8 codes that should be used normally instead of them.


C383C2A1 (C3A1) LATIN SMALL LETTER A WITH ACUTE
C383C2A9 (C3A9) LATIN SMALL LETTER E WITH ACUTE
C383C2AD (C3AD) LATIN SMALL LETTER I WITH ACUTE
C383C2B3 (C3B3) LATIN SMALL LETTER O WITH ACUTE
C383C2B6 (C3B6) LATIN SMALL LETTER O WITH DIAERESIS
C385E28098 (C591) LATIN SMALL LETTER O WITH DOUBLE ACUTE
C383C2BA (C3BA) LATIN SMALL LETTER U WITH ACUTE
C383C2BC (C3BC) LATIN SMALL LETTER U WITH DIAERESIS
C385C2B1 (C5B1) LATIN SMALL LETTER U WITH DOUBLE ACUTE
C383C281 (C381) LATIN CAPITAL LETTER A WITH ACUTE
C383E280B0 (C389) LATIN CAPITAL LETTER E WITH ACUTE
C383C28D (C38D) LATIN CAPITAL LETTER I WITH ACUTE
C383E2809C (C393) LATIN CAPITAL LETTER O WITH ACUTE
C383E28093 (C396) LATIN CAPITAL LETTER O WITH DIAERESIS
C385C290 (C590) LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
C383C5A1 (C39A) LATIN CAPITAL LETTER U WITH ACUTE
C383C593 (C39C) LATIN CAPITAL LETTER U WITH DIAERESIS
C385C2B0 (C5B0) LATIN CAPITAL LETTER U WITH DOUBLE ACUTE

The letters should be consumed just 2 bytes, but in mysql all of them 
uses 4 or 5(!!!) bytes. Does MySql 5.0 store the UTF8 codes not 
directly? Since Bugzilla displays the entered bug correctly (it reads 
the vowels properly from database), I think there must be some rule in 
mysql storing utf8 codes. When I make a (mysql)dump from the database, 
also these long codes are inside of the dump.


The old dumps were made still with mysql-essential-4.1.12-win32.msi and 
they contain proper UTF8 codes (with 2 bytes length per hungarian 
special vowel), so I do not know why MySQL 5.0 stores these vowels 
differently than MySQL 4.1.


Regards
Zsolt


I think I know the cause of the problem. Bugzilla somehow encodes (and 
decodes) twice the characters.

E.g. E1 (LATIN SMALL LETTER A WITH ACUTE) - C3 A1 - C3 83 C2 A1
So not MySQL is repsonsible for the strange phenomenon, I'm sorry for my 
 previous post.


Zsolt

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]