Re: [Tutor] more encoding confusion

2007-08-05 Thread Jon Crump
Kent, Many thanks again, and thanks too to Paul at
http://tinyurl.com/yrl8cy.

That's very effective, thanks very much for the detailed explanation;
however, I'm a little surprised that it's necessary. I would have thought
that there would be some standard module that included a unicode 
equivalent
of the builtin method isupper().

On Fri, 3 Aug 2007, Kent Johnson wrote:

  What sort of re test can I do to catch lines whose defining
  characteristic is that they begin with two or more adjacent utf-8
  encoded capital letters?

 First you have to decode the file to a Unicode string.
 Then build the set of matching characters and build a regex. For 
example,
 something like this:

 data = open('data.txt').read().decode('utf-8').splitlines()

 uppers = u''.join(unichr(i) for i in xrange(sys.maxunicode)
if unichr(i).isupper())

I modified uppers to include only the latin characters, and added the
apostrophe to catch placenames like L'ISLE.

 upperRe = u'^[%s]{2,}' % uppers

 for line in data:
  if re.match(upperRe, line):


 With a tip of the hat to
 http://tinyurl.com/yrl8cy

 Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] more encoding confusion

2007-08-05 Thread Kent Johnson
Jon Crump wrote:
 
 Kent, Many thanks again, and thanks too to Paul at 
 http://tinyurl.com/yrl8cy.
 
 That's very effective, thanks very much for the detailed explanation; 
 however, I'm a little surprised that it's necessary. I would have 
 thought that there would be some standard module that included a unicode 
 equivalent of the builtin method isupper().

Hmm...actually, isupper() works fine on unicode strings:
In [18]: s='H\303\211RON'.decode('utf-8')
In [21]: print 'H\303\211RON'
HÉRON
In [22]: s.isupper()
Out[22]: True

:-)


 I modified uppers to include only the latin characters, and added the 
 apostrophe to catch placenames like L'ISLE.

Then you are back to needing a regular expression I think.

Kent

PS Please use Reply All to reply on-list.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] more encoding confusion

2007-08-05 Thread Jon Crump

On Sun, 5 Aug 2007, Kent Johnson wrote:

Hmm...actually, isupper() works fine on unicode strings:
In [18]: s='H\303\211RON'.decode('utf-8')
In [21]: print 'H\303\211RON'
HÉRON
In [22]: s.isupper()
Out[22]: True

:-)


I modified uppers to include only the latin characters, and added the 
apostrophe to catch placenames like L'ISLE.


Then you are back to needing a regular expression I think.



Ah! I'm finally starting to get it. My problem wasn't with a regex to test 
the line, my problem was with reading the file in as utf-8 to begin with. 
When I take your advice and decode() right from the start using:


open('textfile').read().decode('utf-8').splitlines()

instead of

input = open('textfile', 'r')
text = input.readlines()

Then the regex problem does not even arise. Now I can use this instead:

for line in data:
if line[0:2].isupper():

as you point out, isupper() works just fine on unicode strings; it also 
seems to consider the apostrophe uppercase as well because this catches 
not only HÉRON, but L'ISLE as well.


Now my only glitch is that line.title() screws up placenames like STOKE 
(BISHOP'S), turning it into Stoke (Bishop'S).___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] more encoding confusion

2007-08-03 Thread Kent Johnson
Jon Crump wrote:
 I'm parsing a utf-8 encoded file with lines characterized by placenames 
 in all caps thus:
 
 HEREFORD, Herefordshire.
 ..other lines..
 HÉRON (LE), Normandie.
 ..other lines..
 
 I identify these lines for parsing using
 
 for line in data:
 if re.match(r'[A-Z]{2,}', line):
 
 but of course this catches HEREFORD, but not HÉRON.
 
 What sort of re test can I do to catch lines whose defining 
 characteristic is that they begin with two or more adjacent utf-8 
 encoded capital letters?

First you have to decode the file to a Unicode string.
Then build the set of matching characters and build a regex. For 
example, something like this:

data = open('data.txt').read().decode('utf-8').splitlines()

uppers = u''.join(unichr(i) for i in xrange(sys.maxunicode)
 if unichr(i).isupper())
upperRe = u'^[%s]{2,}' % uppers

for line in data:
   if re.match(upperRe, line):


With a tip of the hat to
http://tinyurl.com/yrl8cy

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor