Re: regular expression extracting groups
Thanks all for your responses, especially Paul McGuire for the excellent example usage of pyparsing. I'm off to check out pyparsing. Thanks, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression extracting groups
On Aug 10, 7:56 am, Paul Hankin <[EMAIL PROTECTED]> wrote: > On Aug 10, 2:30 pm, [EMAIL PROTECTED] wrote: > > > I'm trying to use regular expressions to help me quickly extract the > > contents of messages that my application will receive. > > Don't use regexps for parsing complex data; they're limited, > completely unreadable, and hugely difficult to debug. Your code is > well written, and you've already reached the limits of the power of > regexps, and it's difficult to read. > > Have a look at pyparsing for a simple solution to your > problem.http://pyparsing.wikispaces.com/ > > -- > Paul Hankin Well, predictably, the pyparsing solution is simple UNTIL we get to the "multidict" options field. Pyparsing has a Dict construct that has the same limitations as Python's dict - only the last key-value would be retained. So I had to write a parse action to manually stitch the key-value groups into the parsed tokens' internal key-value dict. With the basic grammar implemented in pyparsing, it would now be very easy to make some of these internal expressions optional (using Optional wrappers), or parseable in any order (using '&' operator instead of '+' - '&' enforces presence of all values, but in any order). -- Paul from pyparsing import Suppress, Literal, Combine, oneOf, Word, alphanums, \ restOfLine, ZeroOrMore, Group, ParseResults LBRACE,RBRACE,EQ = map(Suppress,"{}=") keylabel = lambda s : Literal(s) + EQ grp_msg_type = Combine("xpl-" + oneOf("cmnd stat trig")) (GROUP_MESSAGE_TYPE) grp_hop = keylabel("hop") + Word("123456789",exact=1)(GROUP_HOP) grp_source = keylabel("source") + Combine(Word(alphanums,max=8) (GROUP_SRC_VENDOR_ID) + '-' + Word(alphanums,max=8) (GROUP_SRC_DEVICE_ID) + '.' + Word(alphanums,max=16) (GROUP_SRC_INSTANCE_ID) )(GROUP_SOURCE) grp_target = keylabel("target") + Combine('*'|Word(alphanums,max=8) (GROUP_TGT_VENDOR_ID) + '-' + Word(alphanums,max=8) (GROUP_TGT_DEVICE_ID) + '.' + Word(alphanums,max=16) (GROUP_TGT_INSTANCE_ID) )(GROUP_TARGET) grp_schema = Combine(Word(alphanums,max=8)(GROUP_SCHEMA_CLASS) + '.' + Word(alphanums,max=8)(GROUP_SCHEMA_TYPE) )(GROUP_SCHEMA) option_key = Word(alphanums+'-',max=16) #~ option_val = Word(printables+' ',max=64) option_val = restOfLine options = (LBRACE + ZeroOrMore(Group(option_key("key") + EQ + option_val("value"))) + RBRACE)("options") # this parse action will take the raw key=value groups and add them to # the current results' named tokens def make_options_dict(tokens): for k,v in tokens.asList(): if k not in tokens: tokens[k] = ParseResults([]) tokens[k] += ParseResults(v) # delete redundant key-value created by pyparsing del tokens["options"] return tokens options.setParseAction(make_options_dict) msgFormat = (grp_msg_type + LBRACE + grp_hop + grp_source + grp_target + RBRACE + grp_schema + options) # parse each message for msgstr in msgdata: msg = msgFormat.parseString(msgstr) #~ print msg.dump() print "Message type:", msg.message_type print "Hop:", msg.hop print "Options:" print msg.options.dump() print Prints: Message type: xpl-stat Hop: 1 Options: [['interval', '10']] - interval: ['10'] Message type: xpl-stat Hop: 1 Options: [['reconf', 'newconf'], ['option', 'interval '], ['option', 'group[16]'], ['option', 'filter[16]']] - option: ['interval ', 'group[16]', 'filter[16]'] - reconf: ['newconf'] -- http://mail.python.org/mailman/listinfo/python-list
RE: regular expression extracting groups
if its *NOT* an exercise in re, and if input is a bunch of lines within '{' and '}' and each line is key="value" pairs, I would not go near re. instead simply parse keys and array of values into a dictionary, and process them from the dictionary as below, and the key option correctly has 2 entries 'value' and '7' in the right order. will work with any input... # assuming variable s has the string.. s = """{ option=value foo=bar another=42 option=7 }""" >>> for line in s.split(): .. ix = line.find('=') .. if ix >= 0: .. key = line[:ix] .. val = line[ix + 1: ] .. try: .. data[key].append(val) .. except KeyError: .. data.setdefault(key, [val]) .. >>> >>> >>> for k, v in data.items(): .. print 'key=%s val=%s' % (k, v) .. .. key=foo val=['bar'] key=option val=['value', '7'] key=another val=['42'] with another dictionary of keys to be processed with a function to process values for that key, its a matter of iterating over keys.. hope that simplifies and helps.. thx Edwin -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Sunday, August 10, 2008 8:30 AM To: python-list@python.org Subject: regular expression extracting groups Hi list, I'm trying to use regular expressions to help me quickly extract the contents of messages that my application will receive. I have worked out most of the regex but the last section of the message has me stumped. This is mostly because I want to pull the content out into regex groups that I can easily access later. I have a regex to extract the key/value pairs but it ends up with only the contents of the last key/value pair encountered. An example of the section of the message that is troubling me appears like this: { option=value foo=bar another=42 option=7 } So it's basically a bunch of lines. Every line is terminated with a '\n' character. The number of key/value fields changes depending on the particular message. Also notice that there are two 'option' keys. This is allowable and I need to cater for it. A couple of example messages are: xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=*\n} \nhbeat.basic\n{\ninterval=10\n}\n xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=vendor- device.instance\n}\nconfig.list\n{\nreconf=newconf\noption=interval \noption=group[16]\noption=filter[16]\n}\n As all messages follow the same pattern I'm hoping to develop a generic regex, instead of one for each message kind - because there are many, that can pull a message from a received packet. The regex I came up with looks like this: # This should match any xPL message GROUP_MESSAGE_TYPE = 'message_type' GROUP_HOP = 'hop' GROUP_SOURCE = 'source' GROUP_TARGET = 'target' GROUP_SRC_VENDOR_ID = 'source_vendor_id' GROUP_SRC_DEVICE_ID = 'source_device_id' GROUP_SRC_INSTANCE_ID = 'source_instance_id' GROUP_TGT_VENDOR_ID = 'target_vendor_id' GROUP_TGT_DEVICE_ID = 'target_device_id' GROUP_TGT_INSTANCE_ID = 'target_instance_id' GROUP_IDENTIFIER_TYPE = 'identifier_type' GROUP_SCHEMA = 'schema' GROUP_SCHEMA_CLASS = 'schema_class' GROUP_SCHEMA_TYPE = 'schema_type' GROUP_OPTION_KEY = 'key' GROUP_OPTION_VALUE = 'value' XplMessageGroupsRe = r'''(?P<%s>xpl-(cmnd|stat|trig)) \n # message type \ {\n # hop=(?P<%s>[1-9]{1}) \n # hop count source=(?P<%s>(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P< %s>[a-z0-9]{1,16}))\n # source identifier target=(?P<%s>(\*|(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P< %s>[a-z0-9]{1,16})))\n # target identifier \} \n # (?P<%s>(?P<%s>[a-z0-9]{1,8})\.(?P<%s>[a-z0-9]{1,8}))\n # schema \ {\n # (?:(?P<%s>[a-z0-9\-]{1,16})=(?P<%s>[\x20-\x7E]{0,128})\n){1,64} # key/value pairs \}\n''' % (GROUP_MESSAGE_TYPE, GROUP_HOP, GROUP_SOURCE, GROUP_SRC_VENDOR_ID, GROUP_SRC_DEVICE_ID, GROUP_SRC_INSTANCE_ID, GROUP_TARGET, GROUP_TGT_VENDOR_ID, GROUP_TGT_DEVICE_ID, GROUP_TGT_INSTANCE_ID, GROUP_SCHEMA, GROUP_SCHEMA_CLASS, GROUP_SCHEMA_TYPE, GROUP_OPTION_KEY, GROUP_OPTION_VALUE) XplMessageGroups = re.compile(XplMessageGroupsRe, re.VERBOSE | re.DOTALL) If I pass the second example message through this regex the 'key' group ends up containing 'option' and the 'value' group ends up containing 'filter[16]' which are the last key/value pairs in that message. So the problem I have lies in the key/value regex extraction section. It handles multiple occurrences of the pattern and writes the content into the single key/value group hence I can't extract and access all fields. Is there some other way to do this which allows me to store all the key/value pairs into the regex mat
Re: regular expression extracting groups
On Aug 10, 2:30 pm, [EMAIL PROTECTED] wrote: > I'm trying to use regular expressions to help me quickly extract the > contents of messages that my application will receive. Don't use regexps for parsing complex data; they're limited, completely unreadable, and hugely difficult to debug. Your code is well written, and you've already reached the limits of the power of regexps, and it's difficult to read. Have a look at pyparsing for a simple solution to your problem. http://pyparsing.wikispaces.com/ -- Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list