[Tutor] more encoding confusion

Jon Crump Fri, 03 Aug 2007 11:32:32 -0700

I'm parsing a utf-8 encoded file with lines characterized by placenames inall caps thus:


HEREFORD, Herefordshire.
..other lines..
HÉRON (LE), Normandie.
..other lines..


I identify these lines for parsing using

for line in data:
    if re.match(r'[A-Z]{2,}', line):

but of course this catches HEREFORD, but not HÉRON.

What sort of re test can I do to catch lines whose defining characteristicis that they begin with two or more adjacent utf-8 encoded capitalletters?

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] more encoding confusion

Reply via email to