On Thu, 5 Jul 2007, Kent Johnson wrote: >> First, don't confuse unicode and utf-8. > > Too late ;-) already pitifully confused.
> This is a good place to start correcting that: > http://www.joelonsoftware.com/articles/Unicode.html Thanks for this, it's just what I needed! > if s is your utf-8 string, instead of s.title(), use > s.decode('utf-8').title().encode('utf-8') Eureka! I was trying to make the round-trip _way_ too complicated. >> to identify and process the place name data. If I translate the line to >> unicode, the re fails. > > I don't know why that is, re works with unicode strings: > In [1]: import re > In [2]: re.match(r'[A-Z]{2,}', 'ABC') > Out[2]: <_sre.SRE_Match object at 0x12078e0> > In [3]: re.match(r'[A-Z]{2,}', u'ABC') > Out[3]: <_sre.SRE_Match object at 0x11c1f00> Of course. I was misinterpreting why things were failing. It wasn't the regex, it was the decode() encode() round-trip. (a powerful argument for getting familiar with try/except error handling!) Again, many thanks for the education! Jon _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor