Kent I've just thought that as an initial attempt, the last name (of the name before the v.) is sufficient ie. "Turner v. Fouche, 396 U.S. 346 (1970)" instead of "Lathe Turner v. Fouche, 396 U.S. 346 (1970)" as we are only using the citations internally and not displaying publicly. That solves the first name problem.
The remaining problem is picking up multiple pages in a citation ie. "John Doggone Williams v. Florida, 399 U.S. 78, 90 S.Ct. 1893, 234, 26 L.Ed.2d 446 (1970)" ... and a variation of this is: "John Doe Agency v. John Doe Corp., 493 U.S. 146, 159-60 (1934)" I didn't know about pyparsing which appears to be very powerful and have joined their list. Thank-you for your help. Dinesh From: Kent Johnson Sent: Saturday, February 07, 2009 1:19 PM To: Dinesh B Vadhia Cc: tutor@python.org Subject: Re: [Tutor] Picking up citations It turns out you can use Or expressions to cause a kind of backtracking in Pyparsing. This is very close to what you want: Name1 = Forward() Name1 << Combine(Word(alphas) + Name1 | Word(alphas) + Suppress('v.'), joinString=' ', adjacent=False).setResultsName('name1') Name2 = Combine(OneOrMore(Word(alphas)), joinString=' ', adjacent=False).setResultsName('name2') Volume = Word(nums).setResultsName('volume') Reporter = Word(alphas, alphanums+".").setResultsName('reporter') Page = Word(nums).setResultsName('page') Page2 = (',' + Word(nums)).setResultsName('page2') VolumeCitation = (Volume + Reporter + Page).setResultsName('volume_citation', listAllMatches=True) VolumeCitations = Forward() VolumeCitations << ( Combine(VolumeCitation + Page2, joinString=' ', adjacent=False).setResultsName('volume_citation2') + Suppress(',') + VolumeCitations | VolumeCitation + Suppress(',') + VolumeCitations | Combine(VolumeCitation + Page2, joinString=' ', adjacent=False).setResultsName('volume_citation2') | VolumeCitation ) Date = (Suppress('(') + Combine(CharsNotIn(')')).setResultsName('date') + Suppress(')')) FullCitation = Name1 + Name2 + Suppress(',') + VolumeCitations + Date for item in FullCitation.scanString(text): fc = item[0] # Uncomment the following to see the raw parse results # pp(fc) # print # print fc.name1 # print fc.name2 # for vc in fc.volume_citation: # pp(vc) # If name1 is multiple words it is enclosed in a ParseResults name1 = fc.name1 if isinstance(name1, ParseResults): name1 = name1[0] for vc in fc.volume_citation: print '%s v. %s, %s %s %s (%s)' % (name1, fc.name2, vc.volume, vc.reporter, vc.page, fc.date) for vc2 in fc.volume_citation2: print '%s v. %s, %s (%s)' % (name1, fc.name2, vc2, fc.date) print Output: Carter v. Jury Commission of Greene County, 396 U.S. 320 (1970) Carter v. Jury Commission of Greene County, 90 S.Ct. 518 (1970) Carter v. Jury Commission of Greene County, 24 L.Ed.2d 549 (1970) Lathe Turner v. Fouche, 396 U.S. 346 (1970) Lathe Turner v. Fouche, 90 S.Ct. 532 (1970) Lathe Turner v. Fouche, 24 L.Ed.2d 567 (1970) White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966) In John Doggone Williams v. Florida, 399 U.S. 78 (1970) In John Doggone Williams v. Florida, 26 L.Ed.2d 446 (1970) In John Doggone Williams v. Florida, 90 S.Ct. 1893 , 234 (1970) It is correct except for the inclusion of "In" in the name and the extra space before the comma separating the page numbers in the last citation. Don't ask me why I did this :-) Kent
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor