On Mon, Feb 9, 2009 at 12:51 PM, Dinesh B Vadhia <dineshbvad...@hotmail.com> wrote: > Kent /Emmanuel > > Below are the results using the PLY parser and Regex versions on the > attached 'sierra' data which I think covers the common formats. Here are > some 'fully unparsed" citations that were missed by the programs: > > Smith v. Wisconsin Dept. of Agriculture, 23 F.3d 1134, 1141 (7th Cir.1994) > > Indemnified Capital Investments, S.A. v. R.J. O'Brien & Assoc., Inc., 12 > F.3d 1406, 1409 (7th Cir.1993). > > Hunt v. Washington Apple Advertising Commn., 432 U.S. 333, 343, 97 S.Ct. > 2434, 2441, 53 L.Ed.2d 383 (1977) > > Idaho Conservation League v. Mumma, 956 F.2d 1508, 1517-18 (9th Cir.1992)
A few issues here: S.A. - this is hard, to allow this while filtering out sentences R.J. O'Brien, etc. - Loosening up the rules for the second name can allow these 1517-18 - allow page ranges The name issues are getting to be too much for me. Attached is a PLY version that just pulls out the citation without the name; at one point you indicated that would work for you. Kent
# Parser for legal citations, PLY version # This version doesn't parse the names from ply import lex, yacc debug = 0 text = """Indemnified Capital Investments, S.A. v. R.J. O'Brien & Assoc., Inc., 12 F.3d 1406, 1409 (7th Cir.1993). Hunt v. Washington Apple Advertising Commn., 432 U.S. 333, 343, 97 S.Ct. 2434, 2441, 53 L.Ed.2d 383 (1977) Smith v. Wisconsin Dept. of Agriculture, 23 F.3d 1134, 1141 (7th Cir.1994) Idaho Conservation League v. Mumma, 956 F.2d 1508, 1517-18 (9th Cir.1992) NFMA, NEPA, or MUSYA. Sierra Club v. Marita, 843 F.Supp. 1526 (E.D.Wis.1994) ("Nicolet "). Page 500 Carter v. Jury Commission of Greene County, 396 U.S. 320, 90 S.Ct. 518, 24 L.Ed.2d 549 (1970); Lathe Turner v. Fouche, 396 U.S. 346, 90 S.Ct. 532, 24 L.Ed.2d 567 (1970); White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966). Moreover, the Court has also recognized that the exclusion of a discernible class from jury service injures not only those defendants who belong to the excluded class, but other defendants as well, in that it destroys the possibility that the jury will reflect a representative cross section of the community. In John Doggone Williams v. Florida, 399 U.S. 78, 90 S.Ct. 1893, 234, 26 L.Ed.2d 446 (1970), we sought to delineate some of the essential features of the jury that is guaranteed, in certain circumstances, by the Sixth Amendment. We concluded that it comprehends, inter alia, 'a fair possibility for obtaining a representative cross-section of the community.' 399 U.S., at 100, 90 S.Ct., at 1906.9 Thus if the Sixth Amendment were applicable here, and petitioner were challenging a post-Duncan petit jury, he would clearly have standing to challenge the systematic exclusion of any identifiable group from jury service.""" # Lexical tokens tokens = ( 'NUMBER', 'MIXED', 'YEAR', ) literals = ",()-" # Regular expression rules for simple tokens t_NUMBER = r'\d+' t_MIXED = r'[A-Za-z][A-Za-z.0-9\']+' # References and names after the first work t_YEAR = r'\([^)]+\)' # Note: "year" can contain multiple words and non-numeric # A string containing ignored characters (spaces and tabs) t_ignore = ' \t\r\n' # Error handling rule def t_error(t): t.lexer.skip(1) # Build the lexer lexer = lex.lex() def test_lexer(data): lexer.input(data) # Tokenize while True: tok = lexer.token() if not tok: break # No more input print tok # Parser productions def p_Page(p): '''page : NUMBER | NUMBER '-' NUMBER''' if len(p) == 2: p[0] = p[1] else: p[0] = p[1] + p[2] + p[3] def p_Reference(p): '''reference : NUMBER MIXED page''' p[0] = '%s %s %s' % (p[1], p[2], p[3]) def p_Reference_List(p): '''reference_list : reference | reference_list ',' page | reference_list ',' reference | reference_list ',' page ',' reference''' if len(p) == 2: p[0] = [p[1]] # single reference elif len(p) == 4: if p.slice[3].type == 'reference': p[0] = p[1] + [p[3]] # append new reference else: p[1][-1] += ', %s' % p[3] # append page number p[0] = p[1] else: # page number and reference p[1][-1] += ', %s' % p[3] # append page number p[0] = p[1] + [p[5]] # append new reference def p_Citation(p): '''citation : reference_list YEAR error''' for reference in p[1]: print '%s %s' % (reference, p[2]) print def p_Citations(p): '''citations : citation | citations citation''' pass def p_error(p): pass start = 'citations' # Build the parser parser = yacc.yacc() if __name__ == '__main__': parser.parse(text, debug=debug)
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor