Hi,

I'm having trouble working with specific UTF-8 characters.  For
example, the U+10330 character (UTF8: 0xF0 0x90 0x8C 0xB0).

Background: I am trying to clone wiktionary onto local intranets in a
series of (disconnected) schools in Nepal. I'm encountering these
problems when trying to import their big db dump, but have narrowed it
down to a simple test-case below.

I am using MySQL-5.0.77 client and server on Linux. I know these kinds
of problems are commonly user errors, but I think I've covered all the
bases.

First, my command line environment:
        # locale
        LANG=en_US.UTF-8
        LC_CTYPE="en_US.UTF-8"
        LC_NUMERIC="en_US.UTF-8"
        LC_TIME="en_US.UTF-8"
        LC_COLLATE="en_US.UTF-8"
        LC_MONETARY="en_US.UTF-8"
        LC_MESSAGES="en_US.UTF-8"
        LC_PAPER="en_US.UTF-8"
        LC_NAME="en_US.UTF-8"
        LC_ADDRESS="en_US.UTF-8"
        LC_TELEPHONE="en_US.UTF-8"
        LC_MEASUREMENT="en_US.UTF-8"
        LC_IDENTIFICATION="en_US.UTF-8"

all UTF-8.

snippets of my my.cnf:
        [mysqld]
        character_set_server=utf8

        [mysql]
        default-character-set=utf8


inside mysql:
        mysql> SHOW VARIABLES LIKE "character\_set\_%";
        +--------------------------+--------+
        | Variable_name            | Value  |
        +--------------------------+--------+
        | character_set_client     | utf8   |
        | character_set_connection | utf8   |
        | character_set_database   | utf8   |
        | character_set_filesystem | binary |
        | character_set_results    | utf8   |
        | character_set_server     | utf8   |
        | character_set_system     | utf8   |
        +--------------------------+--------+


Hopefully I have convinced you that I am running in a true UTF-8
environment. Now onto the issue: in the above environment I run:

CREATE TABLE dsd (
  `page_id` int(10) unsigned NOT NULL auto_increment,
  `page_title` varchar(255) character set utf8 collate utf8_bin NOT NULL,
  PRIMARY KEY  (`page_id`),
  UNIQUE KEY `name_title` (`page_title`)
)

now I insert one record with a known-working UTF-8 character:

INSERT INTO dsd (page_title) VALUES  (0xc2a3);

This is the UK pound sign: £
http://www.fileformat.info/info/unicode/char/00a3/index.htm

Running a SELECT statement shows that this was inserted just fine.


Now the problematic character:

INSERT INTO dsd (page_title) VALUES (0xf0908cb0);

This character is http://www.fileformat.info/info/unicode/char/10330/index.htm

This gives me the warning:
Warning (Code 1366): Incorrect string value: '\xF0\x90\x8C\xB0' for
column 'page_title' at row 1

and results in a zero-length string being inserted instead.

Can anyone else reproduce this? This is definitely a valid UTF-8
character.  Why is MySQL rejecting it?

The same happens if I input the character directly (rather than using
the hex representation) and also if I input that character directly
from a UTF-8 text file. Any ideas?

Thanks,
Daniel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=arch...@jab.org

Reply via email to