[PHP] utf8_decode() and mixed character sets

2009-10-10 Thread James Colannino
Hey everyone.  I'd been troubled for a while by the fact that inserting
cut-pasted special characters such as auml; caused truncation when passed to
MySQL, then discovered that it was because I was cutting and pasting unicode
values into non-unicode Latin-1 strings.

Since Latin-1 also has equivalent values, I was hoping that filtering my mixed
unicode/non-unicode string through utf8_decode() would solve the problem, but
instead, where the unicode character used to be, I now get a '?', followed by a
few characters being taken out of the middle.  I'm guessing that this is because
utf8_decode() assumes the whole string is unicode and therefore removes a bunch
of extra bytes from the string and corrupts it.  At least, that's my guess.  I
could be very wrong (I have pretty much no experience with different character
sets...)

My question is, what's a good way to translate unicode characters in a
non-unicode string to their Latin-1 equivalents?  I need to be able to do this
in order to sanitize a fairly common form of input.

Thanks!

James

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] utf8_decode() and mixed character sets

2009-10-10 Thread Andrew Ballard
On Sat, Oct 10, 2009 at 11:40 PM, James Colannino ja...@colannino.org wrote:

 Hey everyone.  I'd been troubled for a while by the fact that inserting
 cut-pasted special characters such as auml; caused truncation when passed to
 MySQL, then discovered that it was because I was cutting and pasting unicode
 values into non-unicode Latin-1 strings.

 Since Latin-1 also has equivalent values, I was hoping that filtering my mixed
 unicode/non-unicode string through utf8_decode() would solve the problem, but
 instead, where the unicode character used to be, I now get a '?', followed by 
 a
 few characters being taken out of the middle.  I'm guessing that this is 
 because
 utf8_decode() assumes the whole string is unicode and therefore removes a 
 bunch
 of extra bytes from the string and corrupts it.  At least, that's my guess.  I
 could be very wrong (I have pretty much no experience with different character
 sets...)

 My question is, what's a good way to translate unicode characters in a
 non-unicode string to their Latin-1 equivalents?  I need to be able to do this
 in order to sanitize a fairly common form of input.

 Thanks!

 James

 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php


Have you tried iconv or mb_string? Is it a  option to update the
database to use UTF-8?

Andrew

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] utf8_decode() and mixed character sets

2009-10-10 Thread James Colannino
Andrew Ballard wrote:

 Have you tried iconv or mb_string? Is it a  option to update the
 database to use UTF-8?

I'll look into those functions.  And, I suppose I could in fact convert my
database to use UTF-8 if necessary.

James

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php