Dinesh B Vadhia <dineshbvad...@hotmail.com> wrote: > Hi! I want to process text that contains citations, in this case in legal > documents, and pull-out each individual citation.
Here is my stab at it, using regular expressions. Any comments welcome. I had to use two regexes, one to find all citations, and the other one to split-up citations into their components. They are basically the same, the former without grouping, and the latter with named groups. *** text = "¤some common-law legal comments¤" split_up_cit = re.compile('(?P<name>[A-Z]\w+(?:\s[A-Za-z]\w+)*?)' #name +'\sv\.\s' #versus +'(?P<other_name>[A-Z]\w+(?:\s[A-Za-z]\w+)*?),' #other name +'(?P<refs>[^\(]+)' #references +'(?P<year>\(.*?\d\d\d\d\))[,;.]') # years whole_cit = re.compile('[A-Z]\w+(?:\s[A-Za-z]\w+)*?' #name +'\sv\.\s' #versus +'[A-Z]\w+(?:\s[A-Za-z]\w+)*?,' #other name +'[^\(]+' #references +'\(.*?\d\d\d\d\)[,;.]') # years for cit in whole_cit.findall(text): ref_list = split_up_cit.search(cit).group('refs').split(',') for ref in ref_list: print split_up_cit.search(cit).group('name'), print 'v.', print split_up_cit.search(cit).group('other_name'), print ref, print split_up_cit.search(cit).group('year') *** The results looks like what is expected, with the exception of "In John Doggone Williams" rather than just "John Doggone Williams". As Kent remarked it is difficult to left out of names the parts that should be left out. "Page 500" is easier to deal with, however. I make it mandatory that the first word of the name starts with an uppercase letter ([A-Z]), and that all other words of the name start with a letter ([A-Za-z]). Yes, I include lowercase letter so that names like 'Pierre Choderlos de Laclos' or 'Guido van Rossum' are dealt with correctly. Note that with the [A-Za-z] range, accented letters may not be dealt with correctly. Emmanuel
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor