I'm parsing a utf-8 encoded file with lines characterized by placenames in all caps thus:

HEREFORD, Herefordshire.
..other lines..
HÉRON (LE), Normandie.
..other lines..

I identify these lines for parsing using

for line in data:
    if re.match(r'[A-Z]{2,}', line):

but of course this catches HEREFORD, but not HÉRON.

What sort of re test can I do to catch lines whose defining characteristic is that they begin with two or more adjacent utf-8 encoded capital letters?
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to