Re: UTF-8 ... help

James Gardner Sat, 13 Jan 2007 18:14:22 -0800


Hi Matt,

I'm not trying to be pedantic here either but I'm afraid I didn't wantto leave your email as the last in the thread in case it confuses anyonefollowing on.

My understanding is that what we call "utf-8" is - *is* - *IS* ascii...the ascii representation of unicode, /encoded/ into ascii via the'utf-8' encoding method. That's important to repeat: utf-8 IS NOTunicode. It's a way to STORE unicode in 8-bit bytes (strings).


Have a read of the documentation I wrote here:

http://pylonshq.com/docs/0.9.4.1/internationalization.html#what-is-unicode

My understanding is that unicode is made from code points in memory. Toserialise unicode text for storage or display you need to encode it. Oneway to encode it is to use the UTF-8 encoding. UTF-8 doesn't encode allcharacters as 8 bit strings, however it does encode the characters thatmake up the ASCII character set in 8 bits so ASCII characters encoded asASCII are the same as ASCII characters encoded in UTF-8. This means thatnon-unicode aware programs typically work OK as long as you use Englishcharacters. However, UTF-8 represents non-ASCII characters usingmultiple bytes so those characters are stored very differently and can'tbe represented as ASCII so it is totally wrong to say UTF-8 *is* thesame as ASCII, even though for the ASCII characters the encoded versionsare the same. Hope that's clearer.

Anyway, anything that is 'utf-8' is just ascii, and it should make itthrough templates just fine.

Any characters in the ASCII character set encoded as UTF-8 are the sameas ASCII characters encoded as ASCII and should make it throughtemplates just fine, although it is better to have proper unicode support.


> If the template (or browser) attempts to

decode it improperly, you get output like this: 'gö'. Usually trying to"fix" it is hopeless... it's been mistranslated somewhere up thetoolchain, and one can't reverse-patch it to fix it (though it would betheoretically possible... as it's just look-up-tables).

Well it isn't impossible, the trick is to decode from whatever theencoding of the submitted data is to unicode as soon as it enters yourapplication. You then use unicode strings throughout your app and onlyencode to UTF-8 again right at the end when the browser outputs the page.


Again, its all in the documentation.

To really confuse things you can also edit line 363 of your Pythoninstallation's site.py file to change the default encoding from ascii to"UTF-8" installation wide. Then you might find that as long as all yourpages are UTF-8 your non-unicode adapted code works perfectly wellbecause the input from the browser will be UTF-8 and every time Pythonhits a problem it will assume the text is UTF-8 rather than ASCII whichit probably is and will probably correctly produce unicode strings. Itis a nasty hack but it works rather well in some cases. If you try itjust bear in mind you aren't really solving the problem, just making itgo away and that the change affects all Python libraries you haveinstalled and that might have unforeseen consequences!


HTH,

James

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pylons-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: UTF-8 ... help

Reply via email to