Re: make sure entire string was parsed

2005-09-13 Thread Steven Bethard
Paul McGuire wrote:
> I still don't know the BNF you are working from

Just to satisfy any curiosity you might have, it's the Penn TreeBank 
format: http://www.cis.upenn.edu/~treebank/
(Except that the actual Penn Treebank data unfortunately differs from 
the format spec in a few ways.)

> 1. I'm surprised func_word does not permit numbers anywhere in the
> body.  Is this just a feature you have not implemented yet?  As long as
> func_word does not start with a digit, you can still define one
> unambiguously to allow numbers after the first character if you define
> func_word as
> 
> func_word = _pp.Word(func_chars,func_chars+_pp.nums)

Ahh, very nice.  The spec's vague, but this is probably what I want to do.

> 2. Is coord an optional sub-element of a func?

No, functions, coord and id are optional sub-elements of the tags string.

> You might also add a default value for coord_tag if none is supplied,
> to simplify your parse action?

Oh, that's nice.  I missed that functionality.

> It's not clear to me what if any further help you are looking for, now
> that your initial question (about StringEnd()) has been answered.

Yes, thanks, you definitely answered the initial question.  And your 
followup commentary was also very helpful.  Thanks again!

STeVe
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: make sure entire string was parsed

2005-09-12 Thread Paul McGuire
Steve -

Wow, this is a pretty dense pyparsing program.  You are really pushing
the envelope in your use of ParseResults, dicts, etc., but pretty much
everything seems to be working.

I still don't know the BNF you are working from, but here are some
other "shots in the dark":

1. I'm surprised func_word does not permit numbers anywhere in the
body.  Is this just a feature you have not implemented yet?  As long as
func_word does not start with a digit, you can still define one
unambiguously to allow numbers after the first character if you define
func_word as

func_word = _pp.Word(func_chars,func_chars+_pp.nums)

Perhaps similar for syn_word as well.

2. Is coord an optional sub-element of a func?  If so, you might want
to group them so that they stay together, something like:

coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word))
func_tags = _pp.ZeroOrMore(_pp.Group(tag_sep + func_word+coord_tag))

You might also add a default value for coord_tag if none is supplied,
to simplify your parse action?

coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word),None)

Now the coords and funcs will be kept together.

3. Of course, you are correct in using Combine to ensure that you only
accept adjacent characters.  But you only need to use it at the
outermost level.

4. You can use several dict-like functions directly on a ParseResults
object, such as keys(), items(), values(), in, etc.  Also, the []
notation and the .attribute notation are nearly identical, except that
[] refs on a missing element will raise a KeyError, .attribute will
always return something.  For instance, in your example, the getTag()
parse action uses dict.pop() to extract the 'coord' field.  If coord is
present, you could retrieve it using "tokens['coord']" or
"tokens.coord".  If coord is missing, "tokens['coord']" will raise a
KeyError, but tokens.coord will return an empty string.  If you need to
"listify" a ParseResults, try calling asList().


It's not clear to me what if any further help you are looking for, now
that your initial question (about StringEnd()) has been answered.  But
please let us know how things work out.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: make sure entire string was parsed

2005-09-12 Thread Steven Bethard
Steven Bethard wrote:
> Paul McGuire wrote:
> 
 I have to differentiate between:
  (NP -x-y)
 and:
  (NP-x -y)
 I'm doing this now using Combine.  Does that seem right?
>>
>>
>> If your word char set is just alphanums+"-", then this will work
>> without doing anything unnatural with leaveWhitespace:
>>
>> from pyparsing import *
>>
>> thing = Word(alphanums+"-")
>> LPAREN = Literal("(").suppress()
>> RPAREN = Literal(")").suppress()
>> node = LPAREN + OneOrMore(thing) + RPAREN
>>
>> print node.parseString("(NP -x-y)")
>> print node.parseString("(NP-x -y)")
>>
>> will print:
>>
>> ['NP', '-x-y']
>> ['NP-x', '-y']
> 
> 
> I actually need to break these into:
> 
> ['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
> ['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}

Oops, sorry, the last line should have been:

['NP', 'x', '-y'] {tag:'NP', 'functions':['x'], 'word':'-y'}

Sorry to introduce confusion into an already confusing parsing problem. ;)

STeVe
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: make sure entire string was parsed

2005-09-12 Thread Steven Bethard
Paul McGuire wrote:
>>>I have to differentiate between:
>>>  (NP -x-y)
>>>and:
>>>  (NP-x -y)
>>>I'm doing this now using Combine.  Does that seem right?
> 
> If your word char set is just alphanums+"-", then this will work
> without doing anything unnatural with leaveWhitespace:
> 
> from pyparsing import *
> 
> thing = Word(alphanums+"-")
> LPAREN = Literal("(").suppress()
> RPAREN = Literal(")").suppress()
> node = LPAREN + OneOrMore(thing) + RPAREN
> 
> print node.parseString("(NP -x-y)")
> print node.parseString("(NP-x -y)")
> 
> will print:
> 
> ['NP', '-x-y']
> ['NP-x', '-y']

I actually need to break these into:

['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}

I know the dict syntax afterwards isn't quite what pyparsing would 
output, but hopefully my intent is clear.  I need to use the dict-style 
results from setResultsName() calls because in the full grammar, I have 
a lot of optional elements.  For example:

(NP-1 -a)
   --> {'tag':'NP', 'id':'1', 'word':'-a'}
(NP-x-2 -B)
   --> {'tag':'NP', 'functions':['x'], 'id':'2', 'word':'-B'}
(NP-x-y=2-3 -4)
   --> {'tag':'NP', 'functions':['x', 'y'], 'coord':'2', 'id':'3', 
'word':'-4'}
(-NONE- x)
   --> {'tag':None, 'word':'x'}



STeVe

P.S.  In case you're curious, here's my current draft of the code:

# some character classes
printables_trans = _pp.printables.translate
word_chars = printables_trans(_id_trans, '()')
word_elem = _pp.Word(word_chars)
syn_chars = printables_trans(_id_trans, '()-=')
syn_word = _pp.Word(syn_chars)
func_chars = printables_trans(_id_trans, '()-=0123456789')
func_word = _pp.Word(func_chars)
num_word = _pp.Word(_pp.nums)

# tag separators
dash = _pp.Literal('-')
tag_sep = dash.suppress()
coord_sep = _pp.Literal('=').suppress()

# tag types (use Combine to guarantee no spaces)
special_tag = _pp.Combine(dash + syn_word + dash)
syn_tag = syn_word
func_tags = _pp.ZeroOrMore(_pp.Combine(tag_sep + func_word))
coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word))
id_tag = _pp.Optional(_pp.Combine(tag_sep + num_word))

# give tag types result names
special_tag = special_tag.setResultsName('tag')
syn_tag = syn_tag.setResultsName('tag')
func_tags = func_tags.setResultsName('funcs')
coord_tag = coord_tag.setResultsName('coord')
id_tag = id_tag.setResultsName('id')

# combine tag types into a tags element
normal_tags = syn_tag + func_tags + coord_tag + id_tag
tags = special_tag | _pp.Combine(normal_tags)
def get_tag(orig_string, tokens_start, tokens):
 tokens = dict(tokens)
 tag = tokens.pop('tag')
 if tag == '-NONE-':
 tag = None
 functions = list(tokens.pop('funcs', []))
 coord = tokens.pop('coord', None)
 id = tokens.pop('id', None)
 return [dict(tag=tag, functions=functions,
  coord=coord, id=id)]
tags.setParseAction(get_tag)

# node parentheses
start = _pp.Literal('(').suppress()
end = _pp.Literal(')').suppress()

# words
word = word_elem.setResultsName('word')

# leaf nodes
leaf_node = tags + _pp.Optional(word)
def get_leaf_node(orig_string, tokens_start, tokens):
 try:
 tag_dict, word = tokens
 word = cls._unescape(word)
 except ValueError:
 tag_dict, = tokens
 word = None
 return cls(word=word, **tag_dict)
leaf_node.setParseAction(get_leaf_node)

# node, recursive
node = _pp.Forward()

# branch nodes
branch_node = tags + _pp.OneOrMore(node)
def get_branch_node(orig_string, tokens_start, tokens):
 return cls(children=tokens[1:], **tokens[0])
branch_node.setParseAction(get_branch_node)

# node, recursive
node << start + (branch_node | leaf_node) + end

# root node may have additional parentheses
root_node = node | start + node + end
root_nodes = _pp.OneOrMore(root_node)

# make sure nodes start and end string
str_start = _pp.StringStart()
str_end = _pp.StringEnd()
cls._root_node = str_start + root_node + str_end
cls._root_nodes = str_start + root_nodes + str_end
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: make sure entire string was parsed

2005-09-11 Thread Paul McGuire
Steve -

>>I have to differentiate between:
>>   (NP -x-y)
>>and:
>>   (NP-x -y)
>>I'm doing this now using Combine.  Does that seem right?

If your word char set is just alphanums+"-", then this will work
without doing anything unnatural with leaveWhitespace:

from pyparsing import *

thing = Word(alphanums+"-")
LPAREN = Literal("(").suppress()
RPAREN = Literal(")").suppress()
node = LPAREN + OneOrMore(thing) + RPAREN

print node.parseString("(NP -x-y)")
print node.parseString("(NP-x -y)")

will print:

['NP', '-x-y']
['NP-x', '-y']


Your examples helped me to see what my operator precedence concern was.
 Fortunately, your usage was an And, composed using '+' operators.  If
your construct was a MatchFirst, composed using '|' operators, things
aren't so pretty:

print 2 << 1 | 3
print 2 << (1 | 3)

7
16

So I've just gotten into the habit of parenthesizing anything I load
into a Forward using '<<'.

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: make sure entire string was parsed

2005-09-11 Thread Steven Bethard
Paul McGuire wrote:
> Thanks for giving pyparsing a try!  To see whether your input text
> consumes the whole string, add a StringEnd() element to the end of your
> BNF.  Then if there is more text after the parsed text, parseString
> will throw a ParseException.

Thanks, that's exactly what I was looking for.

> I notice you call leaveWhitespace on several of your parse elements, so
> you may have to rstrip() the input text before calling parseString.  I
> am curious whether leaveWhitespace is really necessary for your
> grammar.  If it is, you can usually just call leaveWhitespace on the
> root element, and this will propagate to all the sub elements.

Yeah, sorry, I was still messing around with that part of the code.  My 
problem is that I have to differentiate between:

   (NP -x-y)

and:

   (NP-x -y)

I'm doing this now using Combine.  Does that seem right?

> Lastly, you may get caught up with operator precedence, I think your
> node assignment statement may need to change from
> node << start + (branch_node | leaf_node) + end
> to
> node << (start + (branch_node | leaf_node) + end)

I think I'm okay:

py> 2 << 1 + 2
16
py> (2 << 1) + 2
6
py> 2 << (1 + 2)
16

Thanks for the help!

STeVe
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: make sure entire string was parsed

2005-09-10 Thread Paul McGuire
Steven -

Thanks for giving pyparsing a try!  To see whether your input text
consumes the whole string, add a StringEnd() element to the end of your
BNF.  Then if there is more text after the parsed text, parseString
will throw a ParseException.

I notice you call leaveWhitespace on several of your parse elements, so
you may have to rstrip() the input text before calling parseString.  I
am curious whether leaveWhitespace is really necessary for your
grammar.  If it is, you can usually just call leaveWhitespace on the
root element, and this will propagate to all the sub elements.

Lastly, you may get caught up with operator precedence, I think your
node assignment statement may need to change from
node << start + (branch_node | leaf_node) + end
to
node << (start + (branch_node | leaf_node) + end)

HTH, 
-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


[pyparsing] make sure entire string was parsed

2005-09-10 Thread Steven Bethard
How do I make sure that my entire string was parsed when I call a 
pyparsing element's parseString method?  Here's a dramatically 
simplified version of my problem:

py> import pyparsing as pp
py> match = pp.Word(pp.nums)
py> def parse_num(s, loc, toks):
... n, = toks
... return int(n) + 10
...
py> match.setParseAction(parse_num)
W:(0123...)
py> match.parseString('121abc')
([131], {})

I want to know (somehow) that when I called match.parseString(), there 
was some of the string left over (in this case, 'abc') after the parse 
was complete.  How can I do this?  (I don't think I can do character 
counting; all my internal setParseAction() functions return non-strings).

STeVe

P.S.  FWIW, I've included the real code below.  I need to throw an 
exception when I call the parseString method of cls._root_node or 
cls._root_nodes and the entire string is not consumed.

--
# some character classes
printables_trans = _pp.printables.translate
word_chars = printables_trans(_id_trans, '()')
syn_tag_chars = printables_trans(_id_trans, '()-=')
func_tag_chars = printables_trans(_id_trans, '()-=0123456789')

# basic tag components
sep = _pp.Literal('-').leaveWhitespace()
alt_sep = _pp.Literal('=').leaveWhitespace()
special_word = _pp.Combine(sep + _pp.Word(syn_tag_chars) + sep)
supp_sep = (alt_sep | sep).suppress()
syn_word = _pp.Word(syn_tag_chars).leaveWhitespace()
func_word = _pp.Word(func_tag_chars).leaveWhitespace()
id_word = _pp.Word(_pp.nums).leaveWhitespace()

# the different tag types
special_tag = special_word.setResultsName('tag')
syn_tag = syn_word.setResultsName('tag')
func_tags = _pp.ZeroOrMore(supp_sep + func_word)
func_tags = func_tags.setResultsName('funcs')
id_tag = _pp.Optional(supp_sep + id_word).setResultsName('id')
tags = special_tag | (syn_tag + func_tags + id_tag)
def get_tag(orig_string, tokens_start, tokens):
 tokens = dict(tokens)
 tag = tokens.pop('tag')
 if tag == '-NONE-':
 tag = None
 functions = list(tokens.pop('funcs', []))
 id = tokens.pop('id', None)
 return [dict(tag=tag, functions=functions, id=id)]
tags.setParseAction(get_tag)

# node parentheses
start = _pp.Literal('(').suppress()
end = _pp.Literal(')').suppress()

# words
word = _pp.Word(word_chars).setResultsName('word')

# leaf nodes
leaf_node = tags + _pp.Optional(word)
def get_leaf_node(orig_string, tokens_start, tokens):
 try:
 tag_dict, word = tokens
 word = cls._unescape(word)
 except ValueError:
 tag_dict, = tokens
 word = None
 return cls(word=word, **tag_dict)
leaf_node.setParseAction(get_leaf_node)

# node, recursive
node = _pp.Forward()

# branch nodes
branch_node = tags + _pp.OneOrMore(node)
def get_branch_node(orig_string, tokens_start, tokens):
 return cls(children=tokens[1:], **tokens[0])
branch_node.setParseAction(get_branch_node)

# node, recursive
node << start + (branch_node | leaf_node) + end

# root node may have additional parentheses
cls._root_node = node | start + node + end
cls._root_nodes = _pp.OneOrMore(cls._root_node)
-- 
http://mail.python.org/mailman/listinfo/python-list