Re: Need Help Converting Character Sets

2012-10-01 Thread hsv
 2012/09/30 11:07 -0700, Mark Phillips 
The data for this table comes from a web page (charet utf8). I copy/paste word 
files into gedit (on linux) and then copy/paste from gedit to a text boxes on 
the web page input form. I had thought I was stripping out all the funky 
characters by using a simple ascii editor like gedit, but obviously not.

After looking at the mysqldump for the table in a hex editor, I discovered I 
have these characters scatter throughout the body and intro columns:
“
” 
’
—
…
↩

How do you mean this? Is there an instance of Unicode character
LEFT DOUBLE QUOTATION MARK, or else of the string "“"?

In any case, this sounds like not an SQL, but general-programming problem; 
furthermore, I suggest that you carefully select a left double quotation mark 
or any of the other punctuation characters (RIGHT DOUBLE QUOTATION MARK, RIGHT 
SINGLE QUOTATION MARK, EM DASH, HORIZONTAL ELLIPSIS, ...) that, it seems, you 
are getting and not liking, and carefully follow your procedure.

It further seems to me that you really do not want that to be UTF-8 string, but 
ASCII, or Latin1.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



RE: Need Help Converting Character Sets

2012-10-01 Thread Rick James
(Sorry; these are random notes, not well structured for your questions.)

&#...; -- This is an "html entity".  It is plain ascii, but _represents_ (in an 
HTML contect) some special character.  Some details:
http://htmltimes.com/html-special-characters.php
http://www.dwheeler.com/essays/quotes-in-html.html

That is, $#...; is not a character set issue in MySQL.  Most CHARACTER SETs 
(latin1, utf8, etc) will handle them the same.  And they will continue to 
render on an HTML page as the 'funky characters'.  That's HTML doing the deed 
in the browser.

The ALTERs did nothing because there was nothing to do.

I don't know what process converted from '"' to “.

The utf8 equivalent:
http://www.fileformat.info/info/unicode/char/201c/index.htm
It would be hex e2809c (3 bytes).

The "correct" way is to get the utf8 encoding from the , and store that 
in a utf8-declared column in the table.

If you already have utf8 bytes in your latin1 fields, it gets messier.

hex 22 is the 'straight' ascii double quote.  8220 and 8221 are the "curved" 
double quotes.  A common convention is to translate both curved ones to the 
single straight one.  But that would be work for you to explicitly do.

If you are using PHP, see functions htmlentities() and html_entity_decode().



> -Original Message-
> From: Mark Phillips [mailto:m...@phillipsmarketing.biz]
> Sent: Sunday, September 30, 2012 11:08 AM
> To: Rick James
> Cc: h...@tbbs.net; Mysql List
> Subject: Re: Need Help Converting Character Sets
> 
> Thanks to all of you for your very helpful suggestions! I was pulled
> off on a different project for a few days, and I am now just getting
> back to this one. Sorry for my absence.
> 
> First, some more information:
> 
> My table definition:
> CREATE TABLE `Articles` (
> `articleID` int(11) unsigned NOT NULL AUTO_INCREMENT, `title`
> varchar(200) NOT NULL DEFAULT '', `author` varchar(200) NOT NULL
> DEFAULT '', `body` text NOT NULL, `intro` text NOT NULL, `caption` text
> NOT NULL, `credits` text NOT NULL, `articleDate` date NOT NULL DEFAULT
> '-00-00', `imageTitle` varchar(255) NOT NULL DEFAULT '',
> `imageAltText` varchar(255) NOT NULL DEFAULT '', `imageWidth` float NOT
> NULL DEFAULT '1', `imageHeight` float NOT NULL DEFAULT '1', `imageFile`
> varchar(255) NOT NULL DEFAULT '', PRIMARY KEY (`articleID`), FULLTEXT
> KEY `search1` (`title`,`author`,`body`,`caption`,`credits`)
> ) ENGINE=MyISAM AUTO_INCREMENT=89 DEFAULT CHARSET=latin1
> 
> First problem - CHARSET should be utf8.
> 
> The data for this table comes from a web page (charet utf8). I
> copy/paste word files into gedit (on linux) and then copy/paste from
> gedit to a text boxes on the web page input form. I had thought I was
> stripping out all the funky characters by using a simple ascii editor
> like gedit, but obviously not.
> 
> After looking at the mysqldump for the table in a hex editor, I
> discovered I have these characters scatter throughout the body and
> intro columns:
> “
> ”
> ’
> —
> …
> ↩
> 
> I tried converting the columns into utf8 with the command ALTER TABLE
>  CONVERT TO CHARACTER SET utf8; but all the special
> characters are still there.
> 
> I tried converting to blob and back to utf8, and that didn't change
> anything. (I had to first drop the fulltext key to convert to blob).
> ALTER TABLE t1 CHANGE c1 c1 BLOB;
> ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8; The above special
> characters are still in the body column when I view the dump file with
> a hex editor.
> 
> Is there a way to replace these special characters with the appropriate
> utf8 characters (eg “ -> 22 hex) within the text columns body and
> intro columns?
> 
> Thanks,
> 
> Mark
> 
> On Fri, Sep 28, 2012 at 7:59 AM, Rick James 
> wrote:
> 
> > Thanks for that link!  That's another subtle issue I had not noted.
> >
> > There are so many combinations, that it is hard to say "do this":
> > * Incoming bytes are latin1 / utf8 / Microsquish control characters.
> > * You do/don't have SET NAMES (or equivalent)
> > * The database/table/column is declared latin1/utf8/other.
> > * The problem is on ingestion / on retrieval.
> >
> > The thing mentioned involved 2 steps:
> > ALTER TABLE ... MODIFY COLUMN  BINARY (or BLOB);  -- to forget any
> > charset knowledge ALTER TABLE ... MODIFY COLUMN  CHARACTER SET ...;
> > -- coming from BINARY, this does not check the encoding.
> > (sorry, don't have the link handy)
> >
> > > -Original Message-
> > > From: h...@tbbs.net [mailto:h...@tbbs.net]
> > > Sent: Thursday, September 27, 2012 2:24 PM
> > > To: Mark Phillips
> > > Cc: Mysql List
> > > Subject: Re: Need Help Converting Character Sets
> > >
> > >  2012/09/24 16:28 -0700, Mark Phillips 
> > > I have a table, Articles, of news articles (in English) with three
> > > text columns for the intro, body, and caption. The data came from a
> > > web page, and the content was cut and pasted from other sources. I
> > > am finding that there are some non utf-8 characters in 

RE: NOT_REGEXP Query Help

2012-10-01 Thread John Nichel
> -Original Message-
> From: Rik Wasmus [mailto:r...@grib.nl]
> Sent: Monday, October 01, 2012 10:44 AM
> To: mysql@lists.mysql.com
> Subject: Re: NOT_REGEXP Query Help
>
> >   I'm having an issue using regular expressions, and I was hoping
> > someone can see the syntax error that I'm missing (Monday morning,
> not
> > enough coffee, etc).  Running the following query returns the error:
> >
> > SELECT `a.custid`,`a.login` FROM ol.customers a WHERE `a.login`
> > NOT_REGEXP '^anonymous[[:digit:]]{3,}$';
> > Error:
> >
> > MySQL said
> > #1064 - You have an error in your SQL syntax; check the manual that
> > corresponds to your MySQL server version for the right syntax to use
> > near 'NOT_REGEXP '
>
> It's
> NOT REGEXP
> ... without the underscore...
> --
> Rik Wasmus
>

You know, I've been staring at this thing for the better part of an
hour...baffled as to why it isn't working.

New rule: THREE cups of coffee before I start working.

Thanks Rik!

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: NOT_REGEXP Query Help

2012-10-01 Thread Rik Wasmus
>   I'm having an issue using regular expressions, and I was hoping someone
> can see the syntax error that I'm missing (Monday morning, not enough
> coffee, etc).  Running the following query returns the error:
> 
> SELECT `a.custid`,`a.login` FROM ol.customers a WHERE `a.login` NOT_REGEXP
> '^anonymous[[:digit:]]{3,}$';
> Error:
> 
> MySQL said
> #1064 - You have an error in your SQL syntax; check the manual that
> corresponds to your MySQL server version for the right syntax to use near
> 'NOT_REGEXP '

It's 
NOT REGEXP
... without the underscore...
-- 
Rik Wasmus

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



NOT_REGEXP Query Help

2012-10-01 Thread John Nichel
Hi all,

  I'm having an issue using regular expressions, and I was hoping someone
can see the syntax error that I'm missing (Monday morning, not enough
coffee, etc).  Running the following query returns the error:

SELECT `a.custid`,`a.login` FROM ol.customers a WHERE `a.login` NOT_REGEXP
'^anonymous[[:digit:]]{3,}$';

Error:

MySQL said
#1064 - You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near
'NOT_REGEXP '^anonymous[[:digit:]]{3,}$'
LIMIT 0, 30' at line 1

MySQL server version 4.1.22 (yeah, I know it's old :) )

Any help would be greatly appreciated.

--
John C. Nichel IV
System Administrator
KegWorks
http://www.kegworks.com
716.362.9212 x160
j...@kegworks.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql