Re: Need Help Converting Character Sets

2012-10-02 Thread hsv
 2012/09/30 11:07 -0700, Mark Phillips 
The data for this table comes from a web page (charet utf8). I copy/paste word 
files into gedit (on linux) and then copy/paste from gedit to a text boxes on 
the web page input form. I had thought I was stripping out all the funky 
characters by using a simple ascii editor like gedit, but obviously not.

After looking at the mysqldump for the table in a hex editor, I discovered I 
have these characters scatter throughout the body and intro columns:
#8220;
#8221; 
#8217;
#8212;
#8230;
#8617;

How do you mean this? Is there an instance of Unicode character
LEFT DOUBLE QUOTATION MARK, or else of the string #8220;?

In any case, this sounds like not an SQL, but general-programming problem; 
furthermore, I suggest that you carefully select a left double quotation mark 
or any of the other punctuation characters (RIGHT DOUBLE QUOTATION MARK, RIGHT 
SINGLE QUOTATION MARK, EM DASH, HORIZONTAL ELLIPSIS, ...) that, it seems, you 
are getting and not liking, and carefully follow your procedure.

It further seems to me that you really do not want that to be UTF-8 string, but 
ASCII, or Latin1.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



RE: Need Help Converting Character Sets

2012-10-01 Thread Rick James
(Sorry; these are random notes, not well structured for your questions.)

#...; -- This is an html entity.  It is plain ascii, but _represents_ (in an 
HTML contect) some special character.  Some details:
http://htmltimes.com/html-special-characters.php
http://www.dwheeler.com/essays/quotes-in-html.html

That is, $#...; is not a character set issue in MySQL.  Most CHARACTER SETs 
(latin1, utf8, etc) will handle them the same.  And they will continue to 
render on an HTML page as the 'funky characters'.  That's HTML doing the deed 
in the browser.

The ALTERs did nothing because there was nothing to do.

I don't know what process converted from '' to #8220;.

The utf8 equivalent:
http://www.fileformat.info/info/unicode/char/201c/index.htm
It would be hex e2809c (3 bytes).

The correct way is to get the utf8 encoding from the form, and store that 
in a utf8-declared column in the table.

If you already have utf8 bytes in your latin1 fields, it gets messier.

hex 22 is the 'straight' ascii double quote.  8220 and 8221 are the curved 
double quotes.  A common convention is to translate both curved ones to the 
single straight one.  But that would be work for you to explicitly do.

If you are using PHP, see functions htmlentities() and html_entity_decode().



 -Original Message-
 From: Mark Phillips [mailto:m...@phillipsmarketing.biz]
 Sent: Sunday, September 30, 2012 11:08 AM
 To: Rick James
 Cc: h...@tbbs.net; Mysql List
 Subject: Re: Need Help Converting Character Sets
 
 Thanks to all of you for your very helpful suggestions! I was pulled
 off on a different project for a few days, and I am now just getting
 back to this one. Sorry for my absence.
 
 First, some more information:
 
 My table definition:
 CREATE TABLE `Articles` (
 `articleID` int(11) unsigned NOT NULL AUTO_INCREMENT, `title`
 varchar(200) NOT NULL DEFAULT '', `author` varchar(200) NOT NULL
 DEFAULT '', `body` text NOT NULL, `intro` text NOT NULL, `caption` text
 NOT NULL, `credits` text NOT NULL, `articleDate` date NOT NULL DEFAULT
 '-00-00', `imageTitle` varchar(255) NOT NULL DEFAULT '',
 `imageAltText` varchar(255) NOT NULL DEFAULT '', `imageWidth` float NOT
 NULL DEFAULT '1', `imageHeight` float NOT NULL DEFAULT '1', `imageFile`
 varchar(255) NOT NULL DEFAULT '', PRIMARY KEY (`articleID`), FULLTEXT
 KEY `search1` (`title`,`author`,`body`,`caption`,`credits`)
 ) ENGINE=MyISAM AUTO_INCREMENT=89 DEFAULT CHARSET=latin1
 
 First problem - CHARSET should be utf8.
 
 The data for this table comes from a web page (charet utf8). I
 copy/paste word files into gedit (on linux) and then copy/paste from
 gedit to a text boxes on the web page input form. I had thought I was
 stripping out all the funky characters by using a simple ascii editor
 like gedit, but obviously not.
 
 After looking at the mysqldump for the table in a hex editor, I
 discovered I have these characters scatter throughout the body and
 intro columns:
 #8220;
 #8221;
 #8217;
 #8212;
 #8230;
 #8617;
 
 I tried converting the columns into utf8 with the command ALTER TABLE
 table_name CONVERT TO CHARACTER SET utf8; but all the special
 characters are still there.
 
 I tried converting to blob and back to utf8, and that didn't change
 anything. (I had to first drop the fulltext key to convert to blob).
 ALTER TABLE t1 CHANGE c1 c1 BLOB;
 ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8; The above special
 characters are still in the body column when I view the dump file with
 a hex editor.
 
 Is there a way to replace these special characters with the appropriate
 utf8 characters (eg #8220; - 22 hex) within the text columns body and
 intro columns?
 
 Thanks,
 
 Mark
 
 On Fri, Sep 28, 2012 at 7:59 AM, Rick James rja...@yahoo-inc.com
 wrote:
 
  Thanks for that link!  That's another subtle issue I had not noted.
 
  There are so many combinations, that it is hard to say do this:
  * Incoming bytes are latin1 / utf8 / Microsquish control characters.
  * You do/don't have SET NAMES (or equivalent)
  * The database/table/column is declared latin1/utf8/other.
  * The problem is on ingestion / on retrieval.
 
  The thing mentioned involved 2 steps:
  ALTER TABLE ... MODIFY COLUMN  BINARY (or BLOB);  -- to forget any
  charset knowledge ALTER TABLE ... MODIFY COLUMN  CHARACTER SET ...;
  -- coming from BINARY, this does not check the encoding.
  (sorry, don't have the link handy)
 
   -Original Message-
   From: h...@tbbs.net [mailto:h...@tbbs.net]
   Sent: Thursday, September 27, 2012 2:24 PM
   To: Mark Phillips
   Cc: Mysql List
   Subject: Re: Need Help Converting Character Sets
  
2012/09/24 16:28 -0700, Mark Phillips 
   I have a table, Articles, of news articles (in English) with three
   text columns for the intro, body, and caption. The data came from a
   web page, and the content was cut and pasted from other sources. I
   am finding that there are some non utf-8 characters in these three
   text columns. I would like to (1) convert these text fields

Re: Need Help Converting Character Sets

2012-09-30 Thread Mark Phillips
Thanks to all of you for your very helpful suggestions! I was pulled off on
a different project for a few days, and I am now just getting back to this
one. Sorry for my absence.

First, some more information:

My table definition:
CREATE TABLE `Articles` (
`articleID` int(11) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(200) NOT NULL DEFAULT '',
`author` varchar(200) NOT NULL DEFAULT '',
`body` text NOT NULL,
`intro` text NOT NULL,
`caption` text NOT NULL,
`credits` text NOT NULL,
`articleDate` date NOT NULL DEFAULT '-00-00',
`imageTitle` varchar(255) NOT NULL DEFAULT '',
`imageAltText` varchar(255) NOT NULL DEFAULT '',
`imageWidth` float NOT NULL DEFAULT '1',
`imageHeight` float NOT NULL DEFAULT '1',
`imageFile` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`articleID`),
FULLTEXT KEY `search1` (`title`,`author`,`body`,`caption`,`credits`)
) ENGINE=MyISAM AUTO_INCREMENT=89 DEFAULT CHARSET=latin1

First problem - CHARSET should be utf8.

The data for this table comes from a web page (charet utf8). I copy/paste
word files into gedit (on linux) and then copy/paste from gedit to a text
boxes on the web page input form. I had thought I was stripping out all the
funky characters by using a simple ascii editor like gedit, but obviously
not.

After looking at the mysqldump for the table in a hex editor, I discovered
I have these characters scatter throughout the body and intro columns:
#8220;
#8221;
#8217;
#8212;
#8230;
#8617;

I tried converting the columns into utf8 with the command
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8;
but all the special characters are still there.

I tried converting to blob and back to utf8, and that didn't change
anything. (I had to first drop the fulltext key to convert to blob).
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The above special characters are still in the body column when I view the
dump file with a hex editor.

Is there a way to replace these special characters with the appropriate
utf8 characters (eg #8220; - 22 hex) within the text columns body and
intro columns?

Thanks,

Mark

On Fri, Sep 28, 2012 at 7:59 AM, Rick James rja...@yahoo-inc.com wrote:

 Thanks for that link!  That's another subtle issue I had not noted.

 There are so many combinations, that it is hard to say do this:
 * Incoming bytes are latin1 / utf8 / Microsquish control characters.
 * You do/don't have SET NAMES (or equivalent)
 * The database/table/column is declared latin1/utf8/other.
 * The problem is on ingestion / on retrieval.

 The thing mentioned involved 2 steps:
 ALTER TABLE ... MODIFY COLUMN  BINARY (or BLOB);  -- to forget any charset
 knowledge
 ALTER TABLE ... MODIFY COLUMN  CHARACTER SET ...;  -- coming from BINARY,
 this does not check the encoding.
 (sorry, don't have the link handy)

  -Original Message-
  From: h...@tbbs.net [mailto:h...@tbbs.net]
  Sent: Thursday, September 27, 2012 2:24 PM
  To: Mark Phillips
  Cc: Mysql List
  Subject: Re: Need Help Converting Character Sets
 
   2012/09/24 16:28 -0700, Mark Phillips 
  I have a table, Articles, of news articles (in English) with three text
  columns for the intro, body, and caption. The data came from a web
  page, and the content was cut and pasted from other sources. I am
  finding that there are some non utf-8 characters in these three text
  columns. I would like to (1) convert these text fields to be strict
  utf-8 and then (2) fix the input page to keep all new submissions utf-
  8.
 
  91) For the first step, fixing the current database, I tried:
 
  update Articles set body = CONVERT(body USING ASCII);
 
  However, when I checked one of the articles I found an apostrophe had
  been converted into a question mark. (FWIW, the apostrophe was one of
  those offending non utf-8 characters):
 
  Before conversion: I stepped into the observatory?s control room ...
 
  After conversion: I stepped into the observatory?s control room...
 
  Is there a better way to accomplish my first goal, without reading each
  article and manually making the changes?
  
  I do not remember where on the MySQL website this is, but there was an
  article about converting from character sets in version 4 to those in
  version 5, when UTF-8 first was supported. It sounds to me that maybe
  the tricks shown there would be useful to you, since, in effect,
  through MySQL MySQL was fooled into accepting for UTF-8 that which was
  not. Conversion to binary string was mentioned.
 
 
  --
  MySQL General Mailing List
  For list archives: http://lists.mysql.com/mysql
  To unsubscribe:http://lists.mysql.com/mysql




RE: Need Help Converting Character Sets

2012-09-28 Thread Rick James
Thanks for that link!  That's another subtle issue I had not noted.

There are so many combinations, that it is hard to say do this:
* Incoming bytes are latin1 / utf8 / Microsquish control characters.
* You do/don't have SET NAMES (or equivalent)
* The database/table/column is declared latin1/utf8/other.
* The problem is on ingestion / on retrieval.

The thing mentioned involved 2 steps:
ALTER TABLE ... MODIFY COLUMN  BINARY (or BLOB);  -- to forget any charset 
knowledge
ALTER TABLE ... MODIFY COLUMN  CHARACTER SET ...;  -- coming from BINARY, this 
does not check the encoding.
(sorry, don't have the link handy)

 -Original Message-
 From: h...@tbbs.net [mailto:h...@tbbs.net]
 Sent: Thursday, September 27, 2012 2:24 PM
 To: Mark Phillips
 Cc: Mysql List
 Subject: Re: Need Help Converting Character Sets
 
  2012/09/24 16:28 -0700, Mark Phillips 
 I have a table, Articles, of news articles (in English) with three text
 columns for the intro, body, and caption. The data came from a web
 page, and the content was cut and pasted from other sources. I am
 finding that there are some non utf-8 characters in these three text
 columns. I would like to (1) convert these text fields to be strict
 utf-8 and then (2) fix the input page to keep all new submissions utf-
 8.
 
 91) For the first step, fixing the current database, I tried:
 
 update Articles set body = CONVERT(body USING ASCII);
 
 However, when I checked one of the articles I found an apostrophe had
 been converted into a question mark. (FWIW, the apostrophe was one of
 those offending non utf-8 characters):
 
 Before conversion: I stepped into the observatory?s control room ...
 
 After conversion: I stepped into the observatory?s control room...
 
 Is there a better way to accomplish my first goal, without reading each
 article and manually making the changes?
 
 I do not remember where on the MySQL website this is, but there was an
 article about converting from character sets in version 4 to those in
 version 5, when UTF-8 first was supported. It sounds to me that maybe
 the tricks shown there would be useful to you, since, in effect,
 through MySQL MySQL was fooled into accepting for UTF-8 that which was
 not. Conversion to binary string was mentioned.
 
 
 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/mysql


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: Need Help Converting Character Sets

2012-09-27 Thread hsv
 2012/09/24 16:28 -0700, Mark Phillips 
I have a table, Articles, of news articles (in English) with three text
columns for the intro, body, and caption. The data came from a web page,
and the content was cut and pasted from other sources. I am finding that
there are some non utf-8 characters in these three text columns. I would
like to (1) convert these text fields to be strict utf-8 and then (2) fix
the input page to keep all new submissions utf-8.

91) For the first step, fixing the current database, I tried:

update Articles set body = CONVERT(body USING ASCII);

However, when I checked one of the articles I found an apostrophe had been
converted into a question mark. (FWIW, the apostrophe was one of those
offending non utf-8 characters):

Before conversion: I stepped into the observatory’s control room ...

After conversion: I stepped into the observatory?s control room...

Is there a better way to accomplish my first goal, without reading each
article and manually making the changes? 

I do not remember where on the MySQL website this is, but there was an article 
about converting from character sets in version 4 to those in version 5, when 
UTF-8 first was supported. It sounds to me that maybe the tricks shown there 
would be useful to you, since, in effect, through MySQL MySQL was fooled into 
accepting for UTF-8 that which was not. Conversion to binary string was 
mentioned.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: Need Help Converting Character Sets

2012-09-27 Thread Derek Downey
To go along with what Rick is saying, this link might help you: 
http://dba.stackexchange.com/questions/10467/how-to-convert-control-characters-in-mysql-from-latin1-to-utf-8

I remember doing a bunch of converting HEX() control characters (such as an 
apostrophe copied from a Word document) before attempting the SET NAMES.

Derek Downey


On Sep 24, 2012, at 7:53 PM, Rick James wrote:

 If you have a mixture of encodings, you are in deep doodoo.
 
 This page describes some debugging techniques and some issues:
   http://mysql.rjweb.org/doc.php/charcoll
 
 That apostrophe might be MicroSquish's smart quote.
 
 Can you provide SELECT HEX(the_field) FROM... ?  We (or the above page) might 
 be able to interpret the character.
 
 To prevent future char set issues, you must know what encoding the source is. 
  Then, with SET NAMES (etc), you tell mysqld that the bytes you have in hand 
 are encoded that way.  mysqld will then convert those bytes to the character 
 set of declared for the column they go in.  (Presumably, all the text columns 
 will be declared utf8 or utf8mb4.)
 
 -Original Message-
 From: Mark Phillips [mailto:m...@phillipsmarketing.biz]
 Sent: Monday, September 24, 2012 4:28 PM
 To: Mysql List
 Subject: Need Help Converting Character Sets
 
 I have a table, Articles, of news articles (in English) with three text
 columns for the intro, body, and caption. The data came from a web
 page, and the content was cut and pasted from other sources. I am
 finding that there are some non utf-8 characters in these three text
 columns. I would like to (1) convert these text fields to be strict
 utf-8 and then (2) fix the input page to keep all new submissions utf-
 8.
 
 91) For the first step, fixing the current database, I tried:
 
 update Articles set body = CONVERT(body USING ASCII);
 
 However, when I checked one of the articles I found an apostrophe had
 been converted into a question mark. (FWIW, the apostrophe was one of
 those offending non utf-8 characters):
 
 Before conversion: I stepped into the observatory's control room ...
 
 After conversion: I stepped into the observatory?s control room...
 
 Is there a better way to accomplish my first goal, without reading each
 article and manually making the changes?
 
 (2) For the second goal, insuring that all future articles are utf-8,
 do I need to change the table structure or the insert query to insure I
 get the correct utf-8 characters into the database?
 
 Thanks,
 
 Mark
 
 --
 MySQL General Mailing List
 For list archives: http://lists.mysql.com/mysql
 To unsubscribe:http://lists.mysql.com/mysql
 



RE: Need Help Converting Character Sets

2012-09-24 Thread Rick James
If you have a mixture of encodings, you are in deep doodoo.

This page describes some debugging techniques and some issues:
   http://mysql.rjweb.org/doc.php/charcoll

That apostrophe might be MicroSquish's smart quote.

Can you provide SELECT HEX(the_field) FROM... ?  We (or the above page) might 
be able to interpret the character.

To prevent future char set issues, you must know what encoding the source is.  
Then, with SET NAMES (etc), you tell mysqld that the bytes you have in hand are 
encoded that way.  mysqld will then convert those bytes to the character set of 
declared for the column they go in.  (Presumably, all the text columns will be 
declared utf8 or utf8mb4.)

 -Original Message-
 From: Mark Phillips [mailto:m...@phillipsmarketing.biz]
 Sent: Monday, September 24, 2012 4:28 PM
 To: Mysql List
 Subject: Need Help Converting Character Sets
 
 I have a table, Articles, of news articles (in English) with three text
 columns for the intro, body, and caption. The data came from a web
 page, and the content was cut and pasted from other sources. I am
 finding that there are some non utf-8 characters in these three text
 columns. I would like to (1) convert these text fields to be strict
 utf-8 and then (2) fix the input page to keep all new submissions utf-
 8.
 
 91) For the first step, fixing the current database, I tried:
 
 update Articles set body = CONVERT(body USING ASCII);
 
 However, when I checked one of the articles I found an apostrophe had
 been converted into a question mark. (FWIW, the apostrophe was one of
 those offending non utf-8 characters):
 
 Before conversion: I stepped into the observatory's control room ...
 
 After conversion: I stepped into the observatory?s control room...
 
 Is there a better way to accomplish my first goal, without reading each
 article and manually making the changes?
 
 (2) For the second goal, insuring that all future articles are utf-8,
 do I need to change the table structure or the insert query to insure I
 get the correct utf-8 characters into the database?
 
 Thanks,
 
 Mark

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql