>-----Original Message----- >From: Tim McDaniel [mailto:[EMAIL PROTECTED] >Sent: Wednesday, April 16, 2008 2:32 PM >Cc: 'Mysql' >Subject: RE: \x96 in column value? > >On Wed, 16 Apr 2008, Jerry Schwartz <[EMAIL PROTECTED]> wrote: >> I'm running afoul of the UTF8 character set somehow: >> >> mysql> select convert(char(0x96) using utf8); >> +----------------------------------+ >> | convert(char(0x96) using utf8) | >> +----------------------------------+ >> | NULL | >> +----------------------------------+ >> 1 row in set, 1 warning (0.00 sec) >> >> mysql> show warnings; >> +-------+------+-------------------------------------+ >> | Level | Code | Message | >> +-------+------+-------------------------------------+ >> | Error | 1300 | Invalid utf8 character string: '96' | >> +-------+------+-------------------------------------+ >> 1 row in set (0.00 sec) >> >> On top of my other problems, I've discovered that pasting the UTF8 >> character represented by 0x96 into the MySQL CLI (Windows) somehow >> converts the character to 0x2D (a normal dash); so a lot of my >> testing has been wasted. Pasting it into a Windows-based editor >> preserves the character as 0x96. > >In an earlier note, he wrote >> You may not be able to see it, but that is actually an n-dash >> (\x96). > >Actually, \x96 is not an en-dash. ><http://www.unicode.org/charts/PDF/U0080.pdf> says that it's >"START OF GUARDED AREA". x96 is in the middle of a block of control >characters from the unnamed control character at \x80 through >APPLICATION PROGRAM COMMAND at \x9F (or arguably NO-BREAK SPACE at >\xa0).
[JS] Right you are. This whole business gives me an extreme headache. When working in PHP, I assume my Windows-generated input is cp1252 and I convert that to UTF-8. Aside from that, we always work in UTF-8 (database and web) because I have to handle Chinese. (I have no idea if I'm doing that right, I can't read the results. ;<) In Microsoft's code page 1252, 0x96 is indeed an n-dash. I think this might be my clue. Although our web pages specify UTF-8, I found an article in MSDN that seems to say that IE interprets UTF-8 pages using a code page in the cp1200 "family", whatever that means. That must be why our data looks correct going end-to-end. I also found http://effbot.org/zone/unicode-gremlins.htm, which gives a bit of Python code to translate some cp1252 bits to their Unicode equivalents. It also give you a nice list of the problem characters. There are also examples in the PHP documentation of the iconv() function, but there is also a comment that 0x96 breaks iconv. I need to chew on this some more. PHP doesn't really handle multi-byte characters until 6.x. > >Microsoft, in some of their Windows code pages, assigned meanings to >those values that differ from the Unicode and ISO-8859-1 standards >(quelle suprise), assigning many of them uses as printable characters. >I think it's the Windows 1250 code page, at ><http://www.microsoft.com/globaldev/reference/sbcs/1250.mspx>. >As that page and ><http://www.microsoft.com/typography/developers/fdsspec/punc2.htm> >note, the Unicode standard value for an en-dash is U+2013 (which >appears to be in hex). > >As to whether this affects the problem I don't know. Since x96 is a >valid character, whether Microsoft or real Unicode, I would not expect >it to be a problem per se. I just wanted to point out what it might >not mean. > >-- >Tim McDaniel, [EMAIL PROTECTED] > >-- >MySQL General Mailing List >For list archives: http://lists.mysql.com/mysql >To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] >infoshop.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]