After considerable investigation into the form input of non-Latin 1
characters to be processed by PHP on a Linux box, I've been able to
distill the issue down considerably, though a solution (and one oddity)
remains confusing.
I found a very helpful web page entitled "On the use of some MS Windows
characters in HTML" that explains my problem rather well at
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html. Recommended
reading for anyone displaying text that may have been entered by
Windows users, especially text pasted in from word-processing apps.
Basically, the problem is this: on a Windows machine using Windows 1252
("Windows Latin 1"), a pair of smart quotes are ASCII characters 147
and 148. There are a number of other "special" characters that Windows
maps onto ASCII 128-159, like em dashes and trademark symbols.
Unfortunately, _true_ Latin 1 (iso-8859-1) reserves chars 128-159 for
control characters. So, while you may type ALT-0147 to type a smart
quote into your word processing app (or allow Word to create them
automagically when you type a quote), when that very same character is
pasted into a web page form set to accept iso-8859-1 or UTF-8 encoding,
it DOES NOT MAP to chr(147) when processed by PHP on a Linux box.
Strangely, pasting in a Word-created smart quote character into a web
form and processing it with PHP produces VERY ODD results. Take the
string
="=
where the quotation mark is a curly-style quote. Tell PHP to step
through the characters and print their ASCII value. The two equal signs
are fine (char 61), but the curly quote comes across as THREE
characters: (226)(128)(156). Where this comes from, I do not understand.
I'm inclined to think that if I _don't_ try to specify the
accept-charset parameter on the form, and _don't_ try to convert em
dashes, curly quotes, etc that I'll probably end up with cleaner text
than I do now.
Still, if anyone has any really helpful input on this topic, please
write me and let me know. We're getting into the ugly guts of page
charset vs. form accept-charset vs. browser input charset vs. latin 1
vs. Windows latin 1 vs. MacRoman here, but I'm surprised that no one
has chimed in on this. Does anyone else ever run into this problem, or
does everyone else's forms just handle all of this magically without
any intervention?
spud.
-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
- Re: [PHP] More on cleaning Windows characters... a . h . s . boy