hubbs wrote on 5/22/2008 2:19 PM:
I would love to move to UTF8.  Problem is, 99% of our site was created
with ISO-8859-1, so when I changed it to UTF8, all " and ' characters
got question marks.  And I am not about to go through the entire site
and retype those.  Not sure what that happens though.

Those curly quotes and other chars are saved in your web pages using the 
Windows-1252 character set.  So when you serve them, they're served as 
Windows-1252 characters.  The problem comes in when you specify UTF-8 to the 
browser, but are sending Windows-1252 characters.

I've created a simple demo to illustrate the issue:

        <http://corry.biz/charset_mismatch.lasso>

There are two byte streams on the page; they both represent the exact same 
characters, just one is in UTF-8 and the other is in Windows-1252.  The 
characters are the MS extended chars that you tend to see come out of MS Word.  
So for example, the trademark sign is 153 in Windows-1252 and 14845090 in UTF-8.

The page is served as UTF-8 by default (via the content-type header), and you 
can see that the UTF-8 byte stream renders properly, but the Windows-1252 byte 
stream doesn't.  If you change the page charset to Windows-1252, then the 
Windows-1252 byte stream renders properly, but the UTF-8 byte stream doesn't.  
This is all what you would expect.

Now it's interesting to see that for ISO-8859-1, which technically doesn't have 
any of those characters, both FF2 and IE7 render the Windows-1252 byte stream 
properly.  What's happening is the browser is helping us out by rendering it as 
Windows-1252, even though the charset is declared as ISO-8859-1 (a mismatch).

And if you look at MacRoman encoding, FF2 doesn't render either byte stream 
properly.  In IE7, it doesn't support MacRoman, so instead IE7 chooses the 
Windows-1252 character set, which then the Windows-1252 byte stream renders 
properly.

Safari behaved identical to FireFox and I didn't try Opera.

- Bil

Reply via email to