After considerable investigation into the form input of non-Latin 1 characters to be processed by PHP on a Linux box, I've been able to distill the issue down considerably, though a solution (and one oddity) remains confusing.

I found a very helpful web page entitled "On the use of some MS Windows characters in HTML" that explains my problem rather well at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html. Recommended reading for anyone displaying text that may have been entered by Windows users, especially text pasted in from word-processing apps.

Basically, the problem is this: on a Windows machine using Windows 1252 ("Windows Latin 1"), a pair of smart quotes are ASCII characters 147 and 148. There are a number of other "special" characters that Windows maps onto ASCII 128-159, like em dashes and trademark symbols.

Unfortunately, _true_ Latin 1 (iso-8859-1) reserves chars 128-159 for control characters. So, while you may type ALT-0147 to type a smart quote into your word processing app (or allow Word to create them automagically when you type a quote), when that very same character is pasted into a web page form set to accept iso-8859-1 or UTF-8 encoding, it DOES NOT MAP to chr(147) when processed by PHP on a Linux box.

Strangely, pasting in a Word-created smart quote character into a web form and processing it with PHP produces VERY ODD results. Take the string

="=

where the quotation mark is a curly-style quote. Tell PHP to step through the characters and print their ASCII value. The two equal signs are fine (char 61), but the curly quote comes across as THREE characters: (226)(128)(156). Where this comes from, I do not understand.

I'm inclined to think that if I _don't_ try to specify the accept-charset parameter on the form, and _don't_ try to convert em dashes, curly quotes, etc that I'll probably end up with cleaner text than I do now.

Still, if anyone has any really helpful input on this topic, please write me and let me know. We're getting into the ugly guts of page charset vs. form accept-charset vs. browser input charset vs. latin 1 vs. Windows latin 1 vs. MacRoman here, but I'm surprised that no one has chimed in on this. Does anyone else ever run into this problem, or does everyone else's forms just handle all of this magically without any intervention?

spud.

-------------------------------------------------------------------
a.h.s. boy
spud(at)nothingness.org "as yes is to if,love is to yes"
http://www.nothingness.org/
-------------------------------------------------------------------


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to