Re: make sure entire string was parsed
Paul McGuire wrote: > I still don't know the BNF you are working from Just to satisfy any curiosity you might have, it's the Penn TreeBank format: http://www.cis.upenn.edu/~treebank/ (Except that the actual Penn Treebank data unfortunately differs from the format spec in a few ways.) > 1. I'm surprised func_word does not permit numbers anywhere in the > body. Is this just a feature you have not implemented yet? As long as > func_word does not start with a digit, you can still define one > unambiguously to allow numbers after the first character if you define > func_word as > > func_word = _pp.Word(func_chars,func_chars+_pp.nums) Ahh, very nice. The spec's vague, but this is probably what I want to do. > 2. Is coord an optional sub-element of a func? No, functions, coord and id are optional sub-elements of the tags string. > You might also add a default value for coord_tag if none is supplied, > to simplify your parse action? Oh, that's nice. I missed that functionality. > It's not clear to me what if any further help you are looking for, now > that your initial question (about StringEnd()) has been answered. Yes, thanks, you definitely answered the initial question. And your followup commentary was also very helpful. Thanks again! STeVe -- http://mail.python.org/mailman/listinfo/python-list
Re: make sure entire string was parsed
Steve - Wow, this is a pretty dense pyparsing program. You are really pushing the envelope in your use of ParseResults, dicts, etc., but pretty much everything seems to be working. I still don't know the BNF you are working from, but here are some other "shots in the dark": 1. I'm surprised func_word does not permit numbers anywhere in the body. Is this just a feature you have not implemented yet? As long as func_word does not start with a digit, you can still define one unambiguously to allow numbers after the first character if you define func_word as func_word = _pp.Word(func_chars,func_chars+_pp.nums) Perhaps similar for syn_word as well. 2. Is coord an optional sub-element of a func? If so, you might want to group them so that they stay together, something like: coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word)) func_tags = _pp.ZeroOrMore(_pp.Group(tag_sep + func_word+coord_tag)) You might also add a default value for coord_tag if none is supplied, to simplify your parse action? coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word),None) Now the coords and funcs will be kept together. 3. Of course, you are correct in using Combine to ensure that you only accept adjacent characters. But you only need to use it at the outermost level. 4. You can use several dict-like functions directly on a ParseResults object, such as keys(), items(), values(), in, etc. Also, the [] notation and the .attribute notation are nearly identical, except that [] refs on a missing element will raise a KeyError, .attribute will always return something. For instance, in your example, the getTag() parse action uses dict.pop() to extract the 'coord' field. If coord is present, you could retrieve it using "tokens['coord']" or "tokens.coord". If coord is missing, "tokens['coord']" will raise a KeyError, but tokens.coord will return an empty string. If you need to "listify" a ParseResults, try calling asList(). It's not clear to me what if any further help you are looking for, now that your initial question (about StringEnd()) has been answered. But please let us know how things work out. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: make sure entire string was parsed
Steven Bethard wrote: > Paul McGuire wrote: > I have to differentiate between: (NP -x-y) and: (NP-x -y) I'm doing this now using Combine. Does that seem right? >> >> >> If your word char set is just alphanums+"-", then this will work >> without doing anything unnatural with leaveWhitespace: >> >> from pyparsing import * >> >> thing = Word(alphanums+"-") >> LPAREN = Literal("(").suppress() >> RPAREN = Literal(")").suppress() >> node = LPAREN + OneOrMore(thing) + RPAREN >> >> print node.parseString("(NP -x-y)") >> print node.parseString("(NP-x -y)") >> >> will print: >> >> ['NP', '-x-y'] >> ['NP-x', '-y'] > > > I actually need to break these into: > > ['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'} > ['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'} Oops, sorry, the last line should have been: ['NP', 'x', '-y'] {tag:'NP', 'functions':['x'], 'word':'-y'} Sorry to introduce confusion into an already confusing parsing problem. ;) STeVe -- http://mail.python.org/mailman/listinfo/python-list
Re: make sure entire string was parsed
Paul McGuire wrote: >>>I have to differentiate between: >>> (NP -x-y) >>>and: >>> (NP-x -y) >>>I'm doing this now using Combine. Does that seem right? > > If your word char set is just alphanums+"-", then this will work > without doing anything unnatural with leaveWhitespace: > > from pyparsing import * > > thing = Word(alphanums+"-") > LPAREN = Literal("(").suppress() > RPAREN = Literal(")").suppress() > node = LPAREN + OneOrMore(thing) + RPAREN > > print node.parseString("(NP -x-y)") > print node.parseString("(NP-x -y)") > > will print: > > ['NP', '-x-y'] > ['NP-x', '-y'] I actually need to break these into: ['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'} ['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'} I know the dict syntax afterwards isn't quite what pyparsing would output, but hopefully my intent is clear. I need to use the dict-style results from setResultsName() calls because in the full grammar, I have a lot of optional elements. For example: (NP-1 -a) --> {'tag':'NP', 'id':'1', 'word':'-a'} (NP-x-2 -B) --> {'tag':'NP', 'functions':['x'], 'id':'2', 'word':'-B'} (NP-x-y=2-3 -4) --> {'tag':'NP', 'functions':['x', 'y'], 'coord':'2', 'id':'3', 'word':'-4'} (-NONE- x) --> {'tag':None, 'word':'x'} STeVe P.S. In case you're curious, here's my current draft of the code: # some character classes printables_trans = _pp.printables.translate word_chars = printables_trans(_id_trans, '()') word_elem = _pp.Word(word_chars) syn_chars = printables_trans(_id_trans, '()-=') syn_word = _pp.Word(syn_chars) func_chars = printables_trans(_id_trans, '()-=0123456789') func_word = _pp.Word(func_chars) num_word = _pp.Word(_pp.nums) # tag separators dash = _pp.Literal('-') tag_sep = dash.suppress() coord_sep = _pp.Literal('=').suppress() # tag types (use Combine to guarantee no spaces) special_tag = _pp.Combine(dash + syn_word + dash) syn_tag = syn_word func_tags = _pp.ZeroOrMore(_pp.Combine(tag_sep + func_word)) coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word)) id_tag = _pp.Optional(_pp.Combine(tag_sep + num_word)) # give tag types result names special_tag = special_tag.setResultsName('tag') syn_tag = syn_tag.setResultsName('tag') func_tags = func_tags.setResultsName('funcs') coord_tag = coord_tag.setResultsName('coord') id_tag = id_tag.setResultsName('id') # combine tag types into a tags element normal_tags = syn_tag + func_tags + coord_tag + id_tag tags = special_tag | _pp.Combine(normal_tags) def get_tag(orig_string, tokens_start, tokens): tokens = dict(tokens) tag = tokens.pop('tag') if tag == '-NONE-': tag = None functions = list(tokens.pop('funcs', [])) coord = tokens.pop('coord', None) id = tokens.pop('id', None) return [dict(tag=tag, functions=functions, coord=coord, id=id)] tags.setParseAction(get_tag) # node parentheses start = _pp.Literal('(').suppress() end = _pp.Literal(')').suppress() # words word = word_elem.setResultsName('word') # leaf nodes leaf_node = tags + _pp.Optional(word) def get_leaf_node(orig_string, tokens_start, tokens): try: tag_dict, word = tokens word = cls._unescape(word) except ValueError: tag_dict, = tokens word = None return cls(word=word, **tag_dict) leaf_node.setParseAction(get_leaf_node) # node, recursive node = _pp.Forward() # branch nodes branch_node = tags + _pp.OneOrMore(node) def get_branch_node(orig_string, tokens_start, tokens): return cls(children=tokens[1:], **tokens[0]) branch_node.setParseAction(get_branch_node) # node, recursive node << start + (branch_node | leaf_node) + end # root node may have additional parentheses root_node = node | start + node + end root_nodes = _pp.OneOrMore(root_node) # make sure nodes start and end string str_start = _pp.StringStart() str_end = _pp.StringEnd() cls._root_node = str_start + root_node + str_end cls._root_nodes = str_start + root_nodes + str_end -- http://mail.python.org/mailman/listinfo/python-list
Re: make sure entire string was parsed
Steve - >>I have to differentiate between: >> (NP -x-y) >>and: >> (NP-x -y) >>I'm doing this now using Combine. Does that seem right? If your word char set is just alphanums+"-", then this will work without doing anything unnatural with leaveWhitespace: from pyparsing import * thing = Word(alphanums+"-") LPAREN = Literal("(").suppress() RPAREN = Literal(")").suppress() node = LPAREN + OneOrMore(thing) + RPAREN print node.parseString("(NP -x-y)") print node.parseString("(NP-x -y)") will print: ['NP', '-x-y'] ['NP-x', '-y'] Your examples helped me to see what my operator precedence concern was. Fortunately, your usage was an And, composed using '+' operators. If your construct was a MatchFirst, composed using '|' operators, things aren't so pretty: print 2 << 1 | 3 print 2 << (1 | 3) 7 16 So I've just gotten into the habit of parenthesizing anything I load into a Forward using '<<'. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: make sure entire string was parsed
Paul McGuire wrote: > Thanks for giving pyparsing a try! To see whether your input text > consumes the whole string, add a StringEnd() element to the end of your > BNF. Then if there is more text after the parsed text, parseString > will throw a ParseException. Thanks, that's exactly what I was looking for. > I notice you call leaveWhitespace on several of your parse elements, so > you may have to rstrip() the input text before calling parseString. I > am curious whether leaveWhitespace is really necessary for your > grammar. If it is, you can usually just call leaveWhitespace on the > root element, and this will propagate to all the sub elements. Yeah, sorry, I was still messing around with that part of the code. My problem is that I have to differentiate between: (NP -x-y) and: (NP-x -y) I'm doing this now using Combine. Does that seem right? > Lastly, you may get caught up with operator precedence, I think your > node assignment statement may need to change from > node << start + (branch_node | leaf_node) + end > to > node << (start + (branch_node | leaf_node) + end) I think I'm okay: py> 2 << 1 + 2 16 py> (2 << 1) + 2 6 py> 2 << (1 + 2) 16 Thanks for the help! STeVe -- http://mail.python.org/mailman/listinfo/python-list
Re: make sure entire string was parsed
Steven - Thanks for giving pyparsing a try! To see whether your input text consumes the whole string, add a StringEnd() element to the end of your BNF. Then if there is more text after the parsed text, parseString will throw a ParseException. I notice you call leaveWhitespace on several of your parse elements, so you may have to rstrip() the input text before calling parseString. I am curious whether leaveWhitespace is really necessary for your grammar. If it is, you can usually just call leaveWhitespace on the root element, and this will propagate to all the sub elements. Lastly, you may get caught up with operator precedence, I think your node assignment statement may need to change from node << start + (branch_node | leaf_node) + end to node << (start + (branch_node | leaf_node) + end) HTH, -- Paul -- http://mail.python.org/mailman/listinfo/python-list
[pyparsing] make sure entire string was parsed
How do I make sure that my entire string was parsed when I call a pyparsing element's parseString method? Here's a dramatically simplified version of my problem: py> import pyparsing as pp py> match = pp.Word(pp.nums) py> def parse_num(s, loc, toks): ... n, = toks ... return int(n) + 10 ... py> match.setParseAction(parse_num) W:(0123...) py> match.parseString('121abc') ([131], {}) I want to know (somehow) that when I called match.parseString(), there was some of the string left over (in this case, 'abc') after the parse was complete. How can I do this? (I don't think I can do character counting; all my internal setParseAction() functions return non-strings). STeVe P.S. FWIW, I've included the real code below. I need to throw an exception when I call the parseString method of cls._root_node or cls._root_nodes and the entire string is not consumed. -- # some character classes printables_trans = _pp.printables.translate word_chars = printables_trans(_id_trans, '()') syn_tag_chars = printables_trans(_id_trans, '()-=') func_tag_chars = printables_trans(_id_trans, '()-=0123456789') # basic tag components sep = _pp.Literal('-').leaveWhitespace() alt_sep = _pp.Literal('=').leaveWhitespace() special_word = _pp.Combine(sep + _pp.Word(syn_tag_chars) + sep) supp_sep = (alt_sep | sep).suppress() syn_word = _pp.Word(syn_tag_chars).leaveWhitespace() func_word = _pp.Word(func_tag_chars).leaveWhitespace() id_word = _pp.Word(_pp.nums).leaveWhitespace() # the different tag types special_tag = special_word.setResultsName('tag') syn_tag = syn_word.setResultsName('tag') func_tags = _pp.ZeroOrMore(supp_sep + func_word) func_tags = func_tags.setResultsName('funcs') id_tag = _pp.Optional(supp_sep + id_word).setResultsName('id') tags = special_tag | (syn_tag + func_tags + id_tag) def get_tag(orig_string, tokens_start, tokens): tokens = dict(tokens) tag = tokens.pop('tag') if tag == '-NONE-': tag = None functions = list(tokens.pop('funcs', [])) id = tokens.pop('id', None) return [dict(tag=tag, functions=functions, id=id)] tags.setParseAction(get_tag) # node parentheses start = _pp.Literal('(').suppress() end = _pp.Literal(')').suppress() # words word = _pp.Word(word_chars).setResultsName('word') # leaf nodes leaf_node = tags + _pp.Optional(word) def get_leaf_node(orig_string, tokens_start, tokens): try: tag_dict, word = tokens word = cls._unescape(word) except ValueError: tag_dict, = tokens word = None return cls(word=word, **tag_dict) leaf_node.setParseAction(get_leaf_node) # node, recursive node = _pp.Forward() # branch nodes branch_node = tags + _pp.OneOrMore(node) def get_branch_node(orig_string, tokens_start, tokens): return cls(children=tokens[1:], **tokens[0]) branch_node.setParseAction(get_branch_node) # node, recursive node << start + (branch_node | leaf_node) + end # root node may have additional parentheses cls._root_node = node | start + node + end cls._root_nodes = _pp.OneOrMore(cls._root_node) -- http://mail.python.org/mailman/listinfo/python-list