Handling special characters

2008-04-18 Thread Jerry Schwartz
This isn't really a problem with MySQL; MySQL just trips on it.

Let me caution you that this discussion applies to Windows. I did observe
this in my applications on Linux, but the problem will show itself only when
reading a file, whether uploaded or not; there will be no problem if the
data is entered into a form in a web browser. I did most of my testing with
0x96, the Windows n-dash.

- Trying to INSERT a funky character will give you a data truncated
message from a 4.x server, or a complaint about the character not being
valid in the target character set from a 5.x server. The error message from
the 5.x server will point you directly to the problem character.

- The ultimate cause of the problem is that Windows applications use the
CP-1252 character set. It is similar to, but not identical to, UTF-8.

- MySQL does not recognize the CP-1252 character set, at least not out of
the box. Therefore you cannot get MySQL to convert CP-1252 characters to
UTF-8. You could try CP-1251, I didn't test that.

- Sourcing a file containing one of these troublesome characters into the
MySQL CLI will trigger the problem, because the data is sucked in as-is.

- Copying and pasting into the CLI will sometimes avoid the problem because
Windows automatically transliterates some characters during that operation.

- Windows does NOT TRANSLATE the characters. Note that I said
transliterates. The 0x96 character, the CP-1252 n-dash, is silently
changed to 0x2D: a hyphen when you paste it into a console application.

- The reason that data entered into a web form works is that IE (and I
suppose other browsers) maps the CP-1252 characters into their corresponding
UTF-8 characters. Thus 0x96, when pasted into a browser, turns into 0xe28093
when pushed back to the server. That three-byte sequence is the UTF-8 n-dash
character.

What we have chosen to do is to transliterate the most common of these
troublesome characters ourselves. In PHP, when working with a file it looks
like this:

$cp1252_special_chars = array(
\x96 = -,
\x97 = --,
\x91 = \\',
\x92 = \\',
\x85 = ...,
\x93 = \,
\x94 = \
);

On the web server side (UTF-8) it looks like this:

$utf_special_chars = array(
\xe2\x80\x93 = -,
\xe2\x80\x94 = --,
\xe2\x80\x98 = \\',
\xe2\x80\x99 = \\',
\xe2\x80\xa6 = ...,
\xe2\x80\x9c = \,
\xe2\x80\x9d = \
);

This is not a perfect solution for everyone, but it suits our needs. The
data we work with comes in higgledy-piggledy, and we want it consistent
whether someone sends us a data file or pastes a string into one of our
forms.

If somebody (not I, given my level of knowledge and available time) adds
CP-1252 to the character sets available with MySQL, then you could use
CONVERT() with USING to handle this (assuming that you knew ahead of time
where your data was coming from).

I hope this helps, I grew a few rings and shed a lot of bark to get here.

Regards,

Jerry Schwartz
The Infoshop by Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com
www.giiexpress.com
www.etudes-marche.com





-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Handling Special Characters

2008-01-21 Thread Jerry Schwartz
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
 Behalf Of Baron Schwartz
 Sent: Friday, January 18, 2008 2:33 PM
 To: Jerry Schwartz
 Cc: mysql@lists.mysql.com
 Subject: Re: Handling Special Characters

 Jerry,

 On Jan 18, 2008 2:27 PM, Jerry Schwartz [EMAIL PROTECTED]
 wrote:
  I am having trouble inserting special characters into a table. I am
 using
  the MySQL client. I put the following commands into a text file (I'm
 on
  WinXP, using Notepad), copy them, and paste them into the MySQL
 command line
  client.
 
  SET NAMES utf8;
 
  CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` (
  `eo_name` VARCHAR( 255 ) NOT NULL
  ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;
 
  INSERT INTO eo_name_table
  (eo_name)
  VALUES
  (Associated British Foods Plc (Abf) - Hot Drinks - World);
 
  SELECT * FROM eo_name_table;
  +-+
  | eo_name |
  +-+
  | Associated British Foods Plc (Abf) - Hot Drinks - World |
  +-+
  1 row in set (0.04 sec)
 
  That symbol before World is an N-dash, 0x96. This works perfectly.
 
  Here's where things go wrong. If instead of pasting these commands
 into the
  client, I source the exact same file, I get this:
 
  +-+
  | eo_name |
  +-+
  | Associated British Foods Plc (Abf) - Hot Drinks |
  +-+
  1 row in set (0.04 sec)
 
  It seems that special characters are treated differently when they
 are in a
  sourced file. It isn't just the N-dash, French accented characters
 cause the
  same problem. These are all single-byte characters.
 
  Just to be sure, I checked the length of the string in the field, and
 it
  accurately reflected the difference. That proves, to my satisfaction,
 that
  the problem is on the input side.
 
  I'm using version 4.1.22 community server on a Linux platform. I'm
 using
  version 5.0.45 of the CLI client.
 
  Anyone have any ideas? I was trying to avoid having to write a
 program to do
  this.

 After you get the data you want into the table via copy/paste, can you
 dump and reload it correctly with mysqldump  dump.sql and mysql 
 dump.sql?  If so, try peeking into the dump file and see how it's
 written there.
[JS] It turns out that the actual copy/paste on the Windows side wasn't
giving me the proper results either. It was not obvious with a visual
inspection, but the paste into a command window was apparently converting
the 0x96 (an N-dash) to 0x20 (a regular hyphen). That strikes me as a little
weird.

If I work on the Linux side, I get better results. I brought my input file
over to Linux, then sourced it into MySQL CLI there. The correct data went
into the data base. I dumped it out on the Linux side and brought the file
back to my PC. Looking at the dumped file with an editor that displays UTF-8
shows that the output of mysqldump was correct.

That gave me a clue. It appears that all of my problems are on the Windows
side. When I forced my input file to be saved using UTF-8 encoding, I got
the right results.

This explains why I've been having trouble entering Chinese characters into
my database. (This N-quad business was just the tip of a very large
iceberg.) I had discovered that copying from a UTF-8 encoded web page
directly into Excel, and then going through some gyrations on the PC end,
worked; but pasting it directly into an editor did not. I'll bet if I force
UTF-8 encoding when I save the file, I'll have better luck.

Thanks for the hint.




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Handling Special Characters

2008-01-21 Thread Jerry Schwartz

 use UTF-8 for text files
 use SET NAMEs to correctly MySQL what charset you are using



 --
 Sebastian
[JS] You are correct. The results were very confusing, especially since the
visual difference between a hyphen (0x2D) and an N-quad (0x96). Also,
copying UTF-8 text to a command window seems to do some automatic
conversions.

Whenever I deal with other characters, I get a brain cramp. You should see
me working with Chinese.

Thanks.




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Handling Special Characters

2008-01-20 Thread Sebastian Mendel

Jerry Schwartz schrieb:

I am having trouble inserting special characters into a table. I am using
the MySQL client. I put the following commands into a text file (I'm on
WinXP, using Notepad), copy them, and paste them into the MySQL command line
client.

SET NAMES utf8;

CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` (
`eo_name` VARCHAR( 255 ) NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;

INSERT INTO eo_name_table
(eo_name)
VALUES
(Associated British Foods Plc (Abf) - Hot Drinks - World);

SELECT * FROM eo_name_table;
+-+
| eo_name |
+-+
| Associated British Foods Plc (Abf) - Hot Drinks - World |
+-+
1 row in set (0.04 sec)

That symbol before World is an N-dash, 0x96. This works perfectly.

Here's where things go wrong. If instead of pasting these commands into the
client, I source the exact same file, I get this:


you need to take care of the used charset with the text-file



[...]

Anyone have any ideas? I was trying to avoid having to write a program to do
this.


use UTF-8 for text files
use SET NAMEs to correctly MySQL what charset you are using



--
Sebastian

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Handling Special Characters

2008-01-19 Thread Jerry Schwartz
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
 Behalf Of Baron Schwartz
 Sent: Friday, January 18, 2008 2:33 PM
 To: Jerry Schwartz
 Cc: mysql@lists.mysql.com
 Subject: Re: Handling Special Characters

 Jerry,

 On Jan 18, 2008 2:27 PM, Jerry Schwartz [EMAIL PROTECTED]
 wrote:
  I am having trouble inserting special characters into a table. I am
 using
  the MySQL client. I put the following commands into a text file (I'm
 on
  WinXP, using Notepad), copy them, and paste them into the MySQL
 command line
  client.
 
  SET NAMES utf8;
 
  CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` (
  `eo_name` VARCHAR( 255 ) NOT NULL
  ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;
 
  INSERT INTO eo_name_table
  (eo_name)
  VALUES
  (Associated British Foods Plc (Abf) - Hot Drinks - World);
 
  SELECT * FROM eo_name_table;
  +-+
  | eo_name |
  +-+
  | Associated British Foods Plc (Abf) - Hot Drinks - World |
  +-+
  1 row in set (0.04 sec)
 
  That symbol before World is an N-dash, 0x96. This works perfectly.
 
  Here's where things go wrong. If instead of pasting these commands
 into the
  client, I source the exact same file, I get this:
 
  +-+
  | eo_name |
  +-+
  | Associated British Foods Plc (Abf) - Hot Drinks |
  +-+
  1 row in set (0.04 sec)
 
  It seems that special characters are treated differently when they
 are in a
  sourced file. It isn't just the N-dash, French accented characters
 cause the
  same problem. These are all single-byte characters.
 
  Just to be sure, I checked the length of the string in the field, and
 it
  accurately reflected the difference. That proves, to my satisfaction,
 that
  the problem is on the input side.
 
  I'm using version 4.1.22 community server on a Linux platform. I'm
 using
  version 5.0.45 of the CLI client.
 
  Anyone have any ideas? I was trying to avoid having to write a
 program to do
  this.

 After you get the data you want into the table via copy/paste, can you
 dump and reload it correctly with mysqldump  dump.sql and mysql 
 dump.sql?  If so, try peeking into the dump file and see how it's
 written there.
[JS] Just to make things more complicated, I've discovered that pasting in
0x96 works fine, pasting in an e-acute does not.

I'll try your suggestion. This has been driving me nuts.




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Handling Special Characters

2008-01-18 Thread Baron Schwartz
Jerry,

On Jan 18, 2008 2:27 PM, Jerry Schwartz [EMAIL PROTECTED] wrote:
 I am having trouble inserting special characters into a table. I am using
 the MySQL client. I put the following commands into a text file (I'm on
 WinXP, using Notepad), copy them, and paste them into the MySQL command line
 client.

 SET NAMES utf8;

 CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` (
 `eo_name` VARCHAR( 255 ) NOT NULL
 ) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;

 INSERT INTO eo_name_table
 (eo_name)
 VALUES
 (Associated British Foods Plc (Abf) - Hot Drinks - World);

 SELECT * FROM eo_name_table;
 +-+
 | eo_name |
 +-+
 | Associated British Foods Plc (Abf) - Hot Drinks - World |
 +-+
 1 row in set (0.04 sec)

 That symbol before World is an N-dash, 0x96. This works perfectly.

 Here's where things go wrong. If instead of pasting these commands into the
 client, I source the exact same file, I get this:

 +-+
 | eo_name |
 +-+
 | Associated British Foods Plc (Abf) - Hot Drinks |
 +-+
 1 row in set (0.04 sec)

 It seems that special characters are treated differently when they are in a
 sourced file. It isn't just the N-dash, French accented characters cause the
 same problem. These are all single-byte characters.

 Just to be sure, I checked the length of the string in the field, and it
 accurately reflected the difference. That proves, to my satisfaction, that
 the problem is on the input side.

 I'm using version 4.1.22 community server on a Linux platform. I'm using
 version 5.0.45 of the CLI client.

 Anyone have any ideas? I was trying to avoid having to write a program to do
 this.

After you get the data you want into the table via copy/paste, can you
dump and reload it correctly with mysqldump  dump.sql and mysql 
dump.sql?  If so, try peeking into the dump file and see how it's
written there.

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Handling Special Characters

2008-01-18 Thread Jerry Schwartz
I am having trouble inserting special characters into a table. I am using
the MySQL client. I put the following commands into a text file (I'm on
WinXP, using Notepad), copy them, and paste them into the MySQL command line
client.

SET NAMES utf8;

CREATE TEMPORARY TABLE `giiexpr_db`.`eo_name_table` (
`eo_name` VARCHAR( 255 ) NOT NULL
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci;

INSERT INTO eo_name_table
(eo_name)
VALUES
(Associated British Foods Plc (Abf) - Hot Drinks - World);

SELECT * FROM eo_name_table;
+-+
| eo_name |
+-+
| Associated British Foods Plc (Abf) - Hot Drinks - World |
+-+
1 row in set (0.04 sec)

That symbol before World is an N-dash, 0x96. This works perfectly.

Here's where things go wrong. If instead of pasting these commands into the
client, I source the exact same file, I get this:

+-+
| eo_name |
+-+
| Associated British Foods Plc (Abf) - Hot Drinks |
+-+
1 row in set (0.04 sec)

It seems that special characters are treated differently when they are in a
sourced file. It isn't just the N-dash, French accented characters cause the
same problem. These are all single-byte characters.

Just to be sure, I checked the length of the string in the field, and it
accurately reflected the difference. That proves, to my satisfaction, that
the problem is on the input side.

I'm using version 4.1.22 community server on a Linux platform. I'm using
version 5.0.45 of the CLI client.

Anyone have any ideas? I was trying to avoid having to write a program to do
this.

Regards,

Jerry Schwartz
The Infoshop by Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com
www.giiexpress.com
www.etudes-marche.com




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]