>-----Original Message-----
>From: Tim McDaniel [mailto:[EMAIL PROTECTED]
>Sent: Wednesday, April 16, 2008 2:32 PM
>Cc: 'Mysql'
>Subject: RE: \x96 in column value?
>
>On Wed, 16 Apr 2008, Jerry Schwartz <[EMAIL PROTECTED]> wrote:
>> I'm running afoul of the UTF8 character set somehow:
>>
>> mysql> select convert(char(0x96) using utf8);
>> +----------------------------------+
>> | convert(char(0x96) using utf8)   |
>> +----------------------------------+
>> | NULL                             |
>> +----------------------------------+
>> 1 row in set, 1 warning (0.00 sec)
>>
>> mysql> show warnings;
>> +-------+------+-------------------------------------+
>> | Level | Code | Message                             |
>> +-------+------+-------------------------------------+
>> | Error | 1300 | Invalid utf8 character string: '96' |
>> +-------+------+-------------------------------------+
>> 1 row in set (0.00 sec)
>>
>> On top of my other problems, I've discovered that pasting the UTF8
>> character represented by 0x96 into the MySQL CLI (Windows) somehow
>> converts the character to 0x2D (a normal dash); so a lot of my
>> testing has been wasted.  Pasting it into a Windows-based editor
>> preserves the character as 0x96.
>
>In an earlier note, he wrote
>> You may not be able to see it, but that is actually an n-dash
>> (\x96).
>
>Actually, \x96 is not an en-dash.
><http://www.unicode.org/charts/PDF/U0080.pdf> says that it's
>"START OF GUARDED AREA".  x96 is in the middle of a block of control
>characters from the unnamed control character at \x80 through
>APPLICATION PROGRAM COMMAND at \x9F (or arguably NO-BREAK SPACE at
>\xa0).

[JS] Right you are. This whole business gives me an extreme headache. When
working in PHP, I assume my Windows-generated input is cp1252 and I convert
that to UTF-8. Aside from that, we always work in UTF-8 (database and web)
because I have to handle Chinese. (I have no idea if I'm doing that right, I
can't read the results. ;<)

In Microsoft's code page 1252, 0x96 is indeed an n-dash. I think this might
be my clue.

Although our web pages specify UTF-8, I found an article in MSDN that seems
to say that IE interprets UTF-8 pages using a code page in the cp1200
"family", whatever that means. That must be why our data looks correct going
end-to-end.

I also found http://effbot.org/zone/unicode-gremlins.htm, which gives a bit
of Python code to translate some cp1252 bits to their Unicode equivalents.
It also give you a nice list of the problem characters. There are also
examples in the PHP documentation of the iconv() function, but there is also
a comment that 0x96 breaks iconv.

I need to chew on this some more. PHP doesn't really handle multi-byte
characters until 6.x.

>
>Microsoft, in some of their Windows code pages, assigned meanings to
>those values that differ from the Unicode and ISO-8859-1 standards
>(quelle suprise), assigning many of them uses as printable characters.
>I think it's the Windows 1250 code page, at
><http://www.microsoft.com/globaldev/reference/sbcs/1250.mspx>.
>As that page and
><http://www.microsoft.com/typography/developers/fdsspec/punc2.htm>
>note, the Unicode standard value for an en-dash is U+2013 (which
>appears to be in hex).
>
>As to whether this affects the problem I don't know.  Since x96 is a
>valid character, whether Microsoft or real Unicode, I would not expect
>it to be a problem per se.  I just wanted to point out what it might
>not mean.
>
>--
>Tim McDaniel, [EMAIL PROTECTED]
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]
>infoshop.com





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to