I guess I'm in the mood for a parsing challenge this weekend, I wrote
a PLY version of the citation parser, see attached. It generates
exactly the output you asked for except for the inclusion of "In" in
the name.

Kent
# Parser for legal citations, PLY version

from ply import lex, yacc

text = """
Page 500 Carter v. Jury Commission of Greene County, 396 U.S. 320, 90 S.Ct. 518, 24 L.Ed.2d 549 (1970); 
Lathe Turner v. Fouche, 396 U.S. 346, 90 S.Ct. 532, 24 L.Ed.2d 567 (1970); 
White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966). 

Moreover, the Court has also recognized that the exclusion of a discernible class from jury service 
injures not only those defendants who belong to the excluded class, 
but other defendants as well, in that it destroys the possibility 
that the jury will reflect a representative cross section of the community. 

In John Doggone Williams v. Florida, 399 U.S. 78, 90 S.Ct. 1893, 234, 26 L.Ed.2d 446 (1970), 

we sought to delineate some of the essential features of the jury that is guaranteed, 
in certain circumstances, by the Sixth Amendment. We concluded that it comprehends, 
inter alia, 'a fair possibility for obtaining a representative cross-section of the community.' 
399 U.S., at 100, 90 S.Ct., at 1906.9 Thus if the Sixth Amendment were applicable here, 
and petitioner were challenging a post-Duncan petit jury, 
he would clearly have standing to challenge the systematic exclusion of any identifiable group from jury service."""

# Lexical tokens

tokens = (
   'NAME',
   'NUMBER',
   'V',
   'MIXED',
   'YEAR',
)

literals = ",()"

# Defining these as functions gives them priority over the simple tokens
def t_V(t):
    r'v\.'
    return t

# The first word of a name, must be all alpha, start with capital letter
def t_NAME(t):
    r'[A-Z][A-Za-z]+'
    return t

# Regular expression rules for simple tokens
t_NUMBER = r'\d+'
t_MIXED = r'[A-Za-z][A-Za-z.0-9]+'  # References and names after the first work
t_YEAR = r'\([^)]+\)'   # Note: "year" can contain multiple words and non-numeric

# A string containing ignored characters (spaces and tabs)
t_ignore  = ' \t\r\n'

# Error handling rule
def t_error(t):
    t.lexer.skip(1)

# Build the lexer
lexer = lex.lex()

def test_lexer(data):
    lexer.input(data)

    # Tokenize
    while True:
        tok = lexer.token()
        if not tok: break      # No more input
        print tok

# Parser productions

def p_Name(p):
    '''name : NAME
            | name NAME
            | name MIXED'''
    p[0] = p[1]
    if len(p) == 3:
        p[0] += ' ' + p[2]

def p_Parties(p):
    'parties : name V name'
    p[0] = '%s v. %s' % (p[1], p[3])

def p_Reference(p):
    '''reference : NUMBER MIXED NUMBER'''
    p[0] = '%s %s %s' % (p[1], p[2], p[3])

def p_Reference_List(p):
    '''reference_list : reference
                      | reference_list ',' NUMBER
                      | reference_list ',' reference
                      | reference_list ',' NUMBER ',' reference'''

    if len(p) == 2:
        p[0] = [p[1]]   # single reference
    elif len(p) == 4:
        if p.slice[3].type == 'reference':
            p[0] = p[1] + [p[3]]   # append new reference
        else:
            p[1][-1] += ', %s' % p[3]   # append page number
            p[0] = p[1]
    else:
        # page number and reference
        p[1][-1] += ', %s' % p[3]   # append page number
        p[0] = p[1] + [p[3]]   # append new reference


def p_Citation(p):
    '''citation : parties ',' reference_list YEAR error'''
    for reference in p[3]:
        print '%s, %s %s' % (p[1], reference, p[4])
    print

def p_Citations(p):
    '''citations : citation
                 | citations citation'''
    pass
    
def p_error(p):
    pass
    
start = 'citations'


# Build the parser
parser = yacc.yacc()


if __name__ == '__main__':
    parser.parse(text)

    
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to