... >> IOW, if you're producing output that has to go into another system >> that doesn't take unicode, it doesn't matter how >> theoretically-correct it would be for your app to process the data in >> unicode form. In that case, unicode is not a feature: it's a bug. >> > This is not always true. If you read a webpage, chop it up so you get > a list of words, create a histogram of word length, and then write the output > as > utf8 to a database. Should you do all your intermediate string operations > on utf8 encoded byte strings? No, you should do them on unicode strings as > otherwise you need to know about the details of how utf8 encodes characters. >
You'd still have problems in Unicode given stuff like å =~ å even though u'\xe5' vs u'a\u030a' (those will look the same depending on your Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw with my current font shows the second as 2 characters.) I realize this was a toy example, but it does point out that Unicode complicates the idea of 'equality' as well as the idea of 'what is a character'. And just saying "decode it to Unicode" isn't really sufficient. John =:-> _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com