On Sun, Feb 8, 2009 at 7:07 PM, Dinesh B Vadhia <dineshbvad...@hotmail.com> wrote: > Hi Kent > > From pyparsing to PLY in a few days ... this is too much to handle! I tried > the program and like you said it works except for the inclusion of the full > name. I tested it on different text and it doesn't work as expected (see > attached file).
I'm finding the PLY version a little easier to work with. It exposes more of the internals and feels like I have more control. The errors in parsing sierra.txt are mostly because of over-aggressive inclusion of words in the first name. Attached is a better one. It also errors on unexpected punctuation which prevents "MUYSA Sierra Club". Kent
# Parser for legal citations, PLY version from ply import lex, yacc text = """NFMA, NEPA, or MUSYA. Sierra Club v. Marita, 843 F.Supp. 1526 (E.D.Wis.1994) ("Nicolet "). Page 500 Carter v. Jury Commission of Greene County, 396 U.S. 320, 90 S.Ct. 518, 24 L.Ed.2d 549 (1970); Lathe Turner v. Fouche, 396 U.S. 346, 90 S.Ct. 532, 24 L.Ed.2d 567 (1970); White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966). Moreover, the Court has also recognized that the exclusion of a discernible class from jury service injures not only those defendants who belong to the excluded class, but other defendants as well, in that it destroys the possibility that the jury will reflect a representative cross section of the community. In John Doggone Williams v. Florida, 399 U.S. 78, 90 S.Ct. 1893, 234, 26 L.Ed.2d 446 (1970), we sought to delineate some of the essential features of the jury that is guaranteed, in certain circumstances, by the Sixth Amendment. We concluded that it comprehends, inter alia, 'a fair possibility for obtaining a representative cross-section of the community.' 399 U.S., at 100, 90 S.Ct., at 1906.9 Thus if the Sixth Amendment were applicable here, and petitioner were challenging a post-Duncan petit jury, he would clearly have standing to challenge the systematic exclusion of any identifiable group from jury service.""" # Lexical tokens tokens = ( 'NAME', 'NUMBER', 'V', 'MIXED', 'PUNCT', # misc punctuation not otherwise part of the grammar 'YEAR', ) literals = ",()" # Defining these as functions gives them priority over the simple tokens def t_V(t): r'v\.' return t # The first word of a name, must be all alpha, start with capital letter def t_NAME(t): r'[A-Z][A-Za-z]+' return t # Regular expression rules for simple tokens t_NUMBER = r'\d+' t_MIXED = r'[A-Za-z][A-Za-z.0-9\']+' # References and names after the first work t_PUNCT = r'[\.;:!?]' # These will generate error tokens t_YEAR = r'\([^)]+\)' # Note: "year" can contain multiple words and non-numeric # A string containing ignored characters (spaces and tabs) t_ignore = ' \t\r\n' # Error handling rule def t_error(t): t.lexer.skip(1) # Build the lexer lexer = lex.lex() def test_lexer(data): lexer.input(data) # Tokenize while True: tok = lexer.token() if not tok: break # No more input print tok # Parser productions # Only restrict initial name to leading caps, no numbers def p_Name1(p): '''name1 : NAME | name1 NAME''' p[0] = p[1] if len(p) == 3: p[0] += ' ' + p[2] def p_Name2(p): '''name2 : NAME | name2 NAME | name2 MIXED''' p[0] = p[1] if len(p) == 3: p[0] += ' ' + p[2] def p_Parties(p): 'parties : name1 V name2' p[0] = '%s v. %s' % (p[1], p[3]) def p_Reference(p): '''reference : NUMBER MIXED NUMBER''' p[0] = '%s %s %s' % (p[1], p[2], p[3]) def p_Reference_List(p): '''reference_list : reference | reference_list ',' NUMBER | reference_list ',' reference | reference_list ',' NUMBER ',' reference''' if len(p) == 2: p[0] = [p[1]] # single reference elif len(p) == 4: if p.slice[3].type == 'reference': p[0] = p[1] + [p[3]] # append new reference else: p[1][-1] += ', %s' % p[3] # append page number p[0] = p[1] else: # page number and reference p[1][-1] += ', %s' % p[3] # append page number p[0] = p[1] + [p[3]] # append new reference def p_Citation(p): '''citation : parties ',' reference_list YEAR error''' for reference in p[3]: print '%s, %s %s' % (p[1], reference, p[4]) print def p_Citations(p): '''citations : citation | citations citation''' pass def p_error(p): pass start = 'citations' # Build the parser parser = yacc.yacc() if __name__ == '__main__': parser.parse(text)
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor