Re: Regular expressions and Unicode

2008-10-02 Thread Peter Otten
Jeffrey Barish wrote: > I have a regular expression that I use to extract the surname: > > surname = r'(?u).+ (\w+)' > > However, when I apply it to this Unicode string, I get only the first 3 > letters of the surname: > > name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k' That's a byte string. You

Re: Regular expressions and Unicode

2008-10-02 Thread skip
Jeffrey> However, when I apply it to this Unicode string, I get only the Jeffrey> first 3 letters of the surname: Jeffrey> name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k' Maybe name = unicode('Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k', "utf-8") ? Yup, that works: >>> name = unico

Regular expressions and Unicode

2008-10-02 Thread Jeffrey Barish
I have a regular expression that I use to extract the surname: surname = r'(?u).+ (\w+)' However, when I apply it to this Unicode string, I get only the first 3 letters of the surname: name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k' surname_re = re.compile(surname) m = surname_re.search(name) m.gr