Re: [Tutor] more encoding confusion

Jon Crump Sun, 05 Aug 2007 12:38:08 -0700

On Sun, 5 Aug 2007, Kent Johnson wrote:

Hmm...actually, isupper() works fine on unicode strings:
In [18]: s='H\303\211RON'.decode('utf-8')
In [21]: print 'H\303\211RON'
HÉRON
In [22]: s.isupper()
Out[22]: True
:-)
I modified uppers to include only the latin characters, and added theapostrophe to catch placenames like L'ISLE.
Then you are back to needing a regular expression I think.

Ah! I'm finally starting to get it. My problem wasn't with a regex to testthe line, my problem was with reading the file in as utf-8 to begin with.When I take your advice and decode() right from the start using:


open('textfile').read().decode('utf-8').splitlines()

instead of

input = open('textfile', 'r')
text = input.readlines()

Then the regex problem does not even arise. Now I can use this instead:

for line in data:
    if line[0:2].isupper():

as you point out, isupper() works just fine on unicode strings; it alsoseems to consider the apostrophe uppercase as well because this catchesnot only HÉRON, but L'ISLE as well.

Now my only glitch is that line.title() screws up placenames like STOKE(BISHOP'S), turning it into Stoke (Bishop'S).

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] more encoding confusion

Reply via email to