Jon Crump wrote: > I'm parsing a utf-8 encoded file with lines characterized by placenames > in all caps thus: > > HEREFORD, Herefordshire. > ..other lines.. > HÉRON (LE), Normandie. > ..other lines.. > > I identify these lines for parsing using > > for line in data: > if re.match(r'[A-Z]{2,}', line): > > but of course this catches HEREFORD, but not HÉRON. > > What sort of re test can I do to catch lines whose defining > characteristic is that they begin with two or more adjacent utf-8 > encoded capital letters?
First you have to decode the file to a Unicode string. Then build the set of matching characters and build a regex. For example, something like this: data = open('data.txt').read().decode('utf-8').splitlines() uppers = u''.join(unichr(i) for i in xrange(sys.maxunicode) if unichr(i).isupper()) upperRe = u'^[%s]{2,}' % uppers for line in data: if re.match(upperRe, line): With a tip of the hat to http://tinyurl.com/yrl8cy Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor