Handling special characters
This isn't really a problem with MySQL; MySQL just trips on it. Let me caution you that this discussion applies to Windows. I did observe this in my applications on Linux, but the problem will show itself only when reading a file, whether uploaded or not; there will be no problem if the data is entered into a form in a web browser. I did most of my testing with 0x96, the Windows n-dash. - Trying to INSERT a funky character will give you a data truncated message from a 4.x server, or a complaint about the character not being valid in the target character set from a 5.x server. The error message from the 5.x server will point you directly to the problem character. - The ultimate cause of the problem is that Windows applications use the CP-1252 character set. It is similar to, but not identical to, UTF-8. - MySQL does not recognize the CP-1252 character set, at least not out of the box. Therefore you cannot get MySQL to convert CP-1252 characters to UTF-8. You could try CP-1251, I didn't test that. - Sourcing a file containing one of these troublesome characters into the MySQL CLI will trigger the problem, because the data is sucked in as-is. - Copying and pasting into the CLI will sometimes avoid the problem because Windows automatically transliterates some characters during that operation. - Windows does NOT TRANSLATE the characters. Note that I said transliterates. The 0x96 character, the CP-1252 n-dash, is silently changed to 0x2D: a hyphen when you paste it into a console application. - The reason that data entered into a web form works is that IE (and I suppose other browsers) maps the CP-1252 characters into their corresponding UTF-8 characters. Thus 0x96, when pasted into a browser, turns into 0xe28093 when pushed back to the server. That three-byte sequence is the UTF-8 n-dash character. What we have chosen to do is to transliterate the most common of these troublesome characters ourselves. In PHP, when working with a file it looks like this: $cp1252_special_chars = array( \x96 = -, \x97 = --, \x91 = \\', \x92 = \\', \x85 = ..., \x93 = \, \x94 = \ ); On the web server side (UTF-8) it looks like this: $utf_special_chars = array( \xe2\x80\x93 = -, \xe2\x80\x94 = --, \xe2\x80\x98 = \\', \xe2\x80\x99 = \\', \xe2\x80\xa6 = ..., \xe2\x80\x9c = \, \xe2\x80\x9d = \ ); This is not a perfect solution for everyone, but it suits our needs. The data we work with comes in higgledy-piggledy, and we want it consistent whether someone sends us a data file or pastes a string into one of our forms. If somebody (not I, given my level of knowledge and available time) adds CP-1252 to the character sets available with MySQL, then you could use CONVERT() with USING to handle this (assuming that you knew ahead of time where your data was coming from). I hope this helps, I grew a few rings and shed a lot of bark to get here. Regards, Jerry Schwartz The Infoshop by Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 www.the-infoshop.com www.giiexpress.com www.etudes-marche.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Handling Special Characters
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Baron Schwartz Sent: Friday, January 18, 2008 2:33 PM To: Jerry Schwartz Cc: mysql@lists.mysql.com Subject: Re: Handling Special Characters Jerry, On Jan 18, 2008 2:27 PM, Jerry Schwartz [EMAIL PROTECTED] wrote: I am having trouble inserting special characters into a table. I am using the MySQL client. I put the following commands into a text file (I'm on WinXP, using Notepad), copy them, and paste them into the MySQL command line client. SET NAMES utf8; CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` ( `eo_name` VARCHAR( 255 ) NOT NULL ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci; INSERT INTO eo_name_table (eo_name) VALUES (Associated British Foods Plc (Abf) - Hot Drinks - World); SELECT * FROM eo_name_table; +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks - World | +-+ 1 row in set (0.04 sec) That symbol before World is an N-dash, 0x96. This works perfectly. Here's where things go wrong. If instead of pasting these commands into the client, I source the exact same file, I get this: +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks | +-+ 1 row in set (0.04 sec) It seems that special characters are treated differently when they are in a sourced file. It isn't just the N-dash, French accented characters cause the same problem. These are all single-byte characters. Just to be sure, I checked the length of the string in the field, and it accurately reflected the difference. That proves, to my satisfaction, that the problem is on the input side. I'm using version 4.1.22 community server on a Linux platform. I'm using version 5.0.45 of the CLI client. Anyone have any ideas? I was trying to avoid having to write a program to do this. After you get the data you want into the table via copy/paste, can you dump and reload it correctly with mysqldump dump.sql and mysql dump.sql? If so, try peeking into the dump file and see how it's written there. [JS] It turns out that the actual copy/paste on the Windows side wasn't giving me the proper results either. It was not obvious with a visual inspection, but the paste into a command window was apparently converting the 0x96 (an N-dash) to 0x20 (a regular hyphen). That strikes me as a little weird. If I work on the Linux side, I get better results. I brought my input file over to Linux, then sourced it into MySQL CLI there. The correct data went into the data base. I dumped it out on the Linux side and brought the file back to my PC. Looking at the dumped file with an editor that displays UTF-8 shows that the output of mysqldump was correct. That gave me a clue. It appears that all of my problems are on the Windows side. When I forced my input file to be saved using UTF-8 encoding, I got the right results. This explains why I've been having trouble entering Chinese characters into my database. (This N-quad business was just the tip of a very large iceberg.) I had discovered that copying from a UTF-8 encoded web page directly into Excel, and then going through some gyrations on the PC end, worked; but pasting it directly into an editor did not. I'll bet if I force UTF-8 encoding when I save the file, I'll have better luck. Thanks for the hint. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Handling Special Characters
use UTF-8 for text files use SET NAMEs to correctly MySQL what charset you are using -- Sebastian [JS] You are correct. The results were very confusing, especially since the visual difference between a hyphen (0x2D) and an N-quad (0x96). Also, copying UTF-8 text to a command window seems to do some automatic conversions. Whenever I deal with other characters, I get a brain cramp. You should see me working with Chinese. Thanks. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Handling Special Characters
Jerry Schwartz schrieb: I am having trouble inserting special characters into a table. I am using the MySQL client. I put the following commands into a text file (I'm on WinXP, using Notepad), copy them, and paste them into the MySQL command line client. SET NAMES utf8; CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` ( `eo_name` VARCHAR( 255 ) NOT NULL ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci; INSERT INTO eo_name_table (eo_name) VALUES (Associated British Foods Plc (Abf) - Hot Drinks - World); SELECT * FROM eo_name_table; +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks - World | +-+ 1 row in set (0.04 sec) That symbol before World is an N-dash, 0x96. This works perfectly. Here's where things go wrong. If instead of pasting these commands into the client, I source the exact same file, I get this: you need to take care of the used charset with the text-file [...] Anyone have any ideas? I was trying to avoid having to write a program to do this. use UTF-8 for text files use SET NAMEs to correctly MySQL what charset you are using -- Sebastian -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: Handling Special Characters
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Baron Schwartz Sent: Friday, January 18, 2008 2:33 PM To: Jerry Schwartz Cc: mysql@lists.mysql.com Subject: Re: Handling Special Characters Jerry, On Jan 18, 2008 2:27 PM, Jerry Schwartz [EMAIL PROTECTED] wrote: I am having trouble inserting special characters into a table. I am using the MySQL client. I put the following commands into a text file (I'm on WinXP, using Notepad), copy them, and paste them into the MySQL command line client. SET NAMES utf8; CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` ( `eo_name` VARCHAR( 255 ) NOT NULL ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci; INSERT INTO eo_name_table (eo_name) VALUES (Associated British Foods Plc (Abf) - Hot Drinks - World); SELECT * FROM eo_name_table; +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks - World | +-+ 1 row in set (0.04 sec) That symbol before World is an N-dash, 0x96. This works perfectly. Here's where things go wrong. If instead of pasting these commands into the client, I source the exact same file, I get this: +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks | +-+ 1 row in set (0.04 sec) It seems that special characters are treated differently when they are in a sourced file. It isn't just the N-dash, French accented characters cause the same problem. These are all single-byte characters. Just to be sure, I checked the length of the string in the field, and it accurately reflected the difference. That proves, to my satisfaction, that the problem is on the input side. I'm using version 4.1.22 community server on a Linux platform. I'm using version 5.0.45 of the CLI client. Anyone have any ideas? I was trying to avoid having to write a program to do this. After you get the data you want into the table via copy/paste, can you dump and reload it correctly with mysqldump dump.sql and mysql dump.sql? If so, try peeking into the dump file and see how it's written there. [JS] Just to make things more complicated, I've discovered that pasting in 0x96 works fine, pasting in an e-acute does not. I'll try your suggestion. This has been driving me nuts. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: Handling Special Characters
Jerry, On Jan 18, 2008 2:27 PM, Jerry Schwartz [EMAIL PROTECTED] wrote: I am having trouble inserting special characters into a table. I am using the MySQL client. I put the following commands into a text file (I'm on WinXP, using Notepad), copy them, and paste them into the MySQL command line client. SET NAMES utf8; CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` ( `eo_name` VARCHAR( 255 ) NOT NULL ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci; INSERT INTO eo_name_table (eo_name) VALUES (Associated British Foods Plc (Abf) - Hot Drinks - World); SELECT * FROM eo_name_table; +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks - World | +-+ 1 row in set (0.04 sec) That symbol before World is an N-dash, 0x96. This works perfectly. Here's where things go wrong. If instead of pasting these commands into the client, I source the exact same file, I get this: +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks | +-+ 1 row in set (0.04 sec) It seems that special characters are treated differently when they are in a sourced file. It isn't just the N-dash, French accented characters cause the same problem. These are all single-byte characters. Just to be sure, I checked the length of the string in the field, and it accurately reflected the difference. That proves, to my satisfaction, that the problem is on the input side. I'm using version 4.1.22 community server on a Linux platform. I'm using version 5.0.45 of the CLI client. Anyone have any ideas? I was trying to avoid having to write a program to do this. After you get the data you want into the table via copy/paste, can you dump and reload it correctly with mysqldump dump.sql and mysql dump.sql? If so, try peeking into the dump file and see how it's written there. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Handling Special Characters
I am having trouble inserting special characters into a table. I am using the MySQL client. I put the following commands into a text file (I'm on WinXP, using Notepad), copy them, and paste them into the MySQL command line client. SET NAMES utf8; CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` ( `eo_name` VARCHAR( 255 ) NOT NULL ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci; INSERT INTO eo_name_table (eo_name) VALUES (Associated British Foods Plc (Abf) - Hot Drinks - World); SELECT * FROM eo_name_table; +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks - World | +-+ 1 row in set (0.04 sec) That symbol before World is an N-dash, 0x96. This works perfectly. Here's where things go wrong. If instead of pasting these commands into the client, I source the exact same file, I get this: +-+ | eo_name | +-+ | Associated British Foods Plc (Abf) - Hot Drinks | +-+ 1 row in set (0.04 sec) It seems that special characters are treated differently when they are in a sourced file. It isn't just the N-dash, French accented characters cause the same problem. These are all single-byte characters. Just to be sure, I checked the length of the string in the field, and it accurately reflected the difference. That proves, to my satisfaction, that the problem is on the input side. I'm using version 4.1.22 community server on a Linux platform. I'm using version 5.0.45 of the CLI client. Anyone have any ideas? I was trying to avoid having to write a program to do this. Regards, Jerry Schwartz The Infoshop by Global Information Incorporated 195 Farmington Ave. Farmington, CT 06032 860.674.8796 / FAX: 860.674.8341 www.the-infoshop.com www.giiexpress.com www.etudes-marche.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]