Hi Matt,

I'm not trying to be pedantic here either but I'm afraid I didn't want to leave your email as the last in the thread in case it confuses anyone following on.

My understanding is that what we call "utf-8" is - *is* - *IS* ascii... the ascii representation of unicode, /encoded/ into ascii via the 'utf-8' encoding method. That's important to repeat: utf-8 IS NOT unicode. It's a way to STORE unicode in 8-bit bytes (strings).

Have a read of the documentation I wrote here:

http://pylonshq.com/docs/0.9.4.1/internationalization.html#what-is-unicode

My understanding is that unicode is made from code points in memory. To serialise unicode text for storage or display you need to encode it. One way to encode it is to use the UTF-8 encoding. UTF-8 doesn't encode all characters as 8 bit strings, however it does encode the characters that make up the ASCII character set in 8 bits so ASCII characters encoded as ASCII are the same as ASCII characters encoded in UTF-8. This means that non-unicode aware programs typically work OK as long as you use English characters. However, UTF-8 represents non-ASCII characters using multiple bytes so those characters are stored very differently and can't be represented as ASCII so it is totally wrong to say UTF-8 *is* the same as ASCII, even though for the ASCII characters the encoded versions are the same. Hope that's clearer.

Anyway, anything that is 'utf-8' is just ascii, and it should make it through templates just fine.

Any characters in the ASCII character set encoded as UTF-8 are the same as ASCII characters encoded as ASCII and should make it through templates just fine, although it is better to have proper unicode support.

> If the template (or browser) attempts to
decode it improperly, you get output like this: 'gö'. Usually trying to "fix" it is hopeless... it's been mistranslated somewhere up the toolchain, and one can't reverse-patch it to fix it (though it would be theoretically possible... as it's just look-up-tables).

Well it isn't impossible, the trick is to decode from whatever the encoding of the submitted data is to unicode as soon as it enters your application. You then use unicode strings throughout your app and only encode to UTF-8 again right at the end when the browser outputs the page.

Again, its all in the documentation.

To really confuse things you can also edit line 363 of your Python installation's site.py file to change the default encoding from ascii to "UTF-8" installation wide. Then you might find that as long as all your pages are UTF-8 your non-unicode adapted code works perfectly well because the input from the browser will be UTF-8 and every time Python hits a problem it will assume the text is UTF-8 rather than ASCII which it probably is and will probably correctly produce unicode strings. It is a nasty hack but it works rather well in some cases. If you try it just bear in mind you aren't really solving the problem, just making it go away and that the change affects all Python libraries you have installed and that might have unforeseen consequences!

HTH,

James

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pylons-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/pylons-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to