Re: pyparsing question: single word values with a double quoted string every once in a while
hubritic colinland...@gmail.com (h) wrote: h I want to parse a log that has entries like this: h [2009-03-17 07:28:05.545476 -0500] rprt s=d2bpr80d6 m=2 mod=mail h cmd=msg module=access rule=x_dynamic_ip action=discard attachments=0 h rcpts=1 h routes=DL_UK_ALL,NOT_DL_UK_ALL,default_inbound,firewallsafe,mail01_mail02,spfsafe h size=4363 guid=291f0f108fd3a6e73a11f96f4fb9e4cd hdr_mid= h qid=n2HCS4ks025832 subject=I want to interview you duration=0.236 h elapsed=0.280 h the keywords will not always be the same. Also differing log levels h will provide a different mix of keywords. h This is good enough to get the majority of cases where there is a h keyword, a = and then a value with no spaces: h Group(Word(alphas + +_-.).setResultsName(keyword) + Suppress h (Literal (=)) + Optional(Word(printables))) h Sometimes there is a subject, which is a quoted string. That is easy h enough to get with this: h dblQuotedString(ZeroOrMore(Word(printables) ) ) h My problem is combining them into one expression. Either I wind up h with just the subject or I wind up with they keywords and their h values, one of which is: h subject, 'I' h which is clearly not what I want. h Do I scan each line twice, first looking for quotes ? Use the MatchFirst (|) I have also split it up to make it more readable kw = Word(alphas + +_-.).setResultsName(keyword) eq = Suppress(Literal (=)) value = dblQuotedString | Optional(Word(printables)) pattern = Group(kw + eq + value) -- Piet van Oostrum p...@cs.uu.nl URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
pyparsing question: single word values with a double quoted string every once in a while
I want to parse a log that has entries like this: [2009-03-17 07:28:05.545476 -0500] rprt s=d2bpr80d6 m=2 mod=mail cmd=msg module=access rule=x_dynamic_ip action=discard attachments=0 rcpts=1 routes=DL_UK_ALL,NOT_DL_UK_ALL,default_inbound,firewallsafe,mail01_mail02,spfsafe size=4363 guid=291f0f108fd3a6e73a11f96f4fb9e4cd hdr_mid= qid=n2HCS4ks025832 subject=I want to interview you duration=0.236 elapsed=0.280 the keywords will not always be the same. Also differing log levels will provide a different mix of keywords. This is good enough to get the majority of cases where there is a keyword, a = and then a value with no spaces: Group(Word(alphas + +_-.).setResultsName(keyword) + Suppress (Literal (=)) + Optional(Word(printables))) Sometimes there is a subject, which is a quoted string. That is easy enough to get with this: dblQuotedString(ZeroOrMore(Word(printables) ) ) My problem is combining them into one expression. Either I wind up with just the subject or I wind up with they keywords and their values, one of which is: subject, 'I' which is clearly not what I want. Do I scan each line twice, first looking for quotes ? Thanks -- http://mail.python.org/mailman/listinfo/python-list
Pyparsing Question
Hi all, I have a question on PyParsing. I am trying to create a parser for a hierarchical todo list format, but have hit a stumbling block. I have parsers for the header of the list (title and description), and the body (recursive descent on todo items). Individually they are working fine, combined they throw an exception. The code follows: #!/usr/bin/python # parser.py import pyparsing as pp def grammar(): underline = pp.Word(=).suppress() dotnum = pp.Combine(pp.Word(pp.nums) + .) textline = pp.Combine(pp.Group(pp.Word(pp.alphas, pp.printables) + pp.restOfLine)) number = pp.Group(pp.OneOrMore(dotnum)) headtitle = textline headdescription = pp.ZeroOrMore(textline) head = pp.Group(headtitle + underline + headdescription) taskname = pp.OneOrMore(dotnum) + textline task = pp.Forward() subtask = pp.Group(dotnum + task) task (taskname + pp.ZeroOrMore(subtask)) maintask = pp.Group(pp.LineStart() + task) parser = pp.OneOrMore(maintask) return head, parser text = My Title Text on a longer line of several words. More test and more. text2 = 1. Task 1 1.1. Subtask 1.1.1. More tasks. 1.2. Another subtask 2. Task 2 2.1. Subtask again head, parser = grammar() print head.parseString(text) print parser.parseString(text2) comb = head + pp.OneOrMore(pp.LineStart() + pp.restOfLine) + parser print comb.parseString(text + text2) #=== Now the first two print statements output the parse tree as I would expect, but the combined parser fails with an exception: Traceback (most recent call last): File parser.py, line 50, in ? print comb.parseString(text + text2) . . [Stacktrace snipped] . raise exc pyparsing.ParseException: Expected start of line (at char 81), (line:9, col:1) Any help appreciated! Cheers, -- Ant. -- http://mail.python.org/mailman/listinfo/python-list
Re: Pyparsing Question
On May 16, 6:43 am, Ant [EMAIL PROTECTED] wrote: Hi all, I have a question on PyParsing. I am trying to create a parser for a hierarchical todo list format, but have hit a stumbling block. I have parsers for the header of the list (title and description), and the body (recursive descent on todo items). LineStart *really* wants to be parsed at the beginning of a line. Your textline reads up to but not including the LineEnd. Try making these changes. 1. Change textline to: textline = pp.Combine( pp.Group(pp.Word(pp.alphas, pp.printables) + pp.restOfLine)) + \ pp.LineEnd().suppress() 2. Change comb to: comb = head + parser With these changes, my version of your code runs ok. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Pyparsing Question
On May 16, 6:43 am, Ant [EMAIL PROTECTED] wrote: Hi all, I have a question on PyParsing. I am trying to create a parser for a hierarchical todo list format, but have hit a stumbling block. I have parsers for the header of the list (title and description), and the body (recursive descent on todo items). Individually they are working fine, combined they throw an exception. The code follows: #!/usr/bin/python # parser.py import pyparsing as pp def grammar(): underline = pp.Word(=).suppress() dotnum = pp.Combine(pp.Word(pp.nums) + .) textline = pp.Combine(pp.Group(pp.Word(pp.alphas, pp.printables) + pp.restOfLine)) number = pp.Group(pp.OneOrMore(dotnum)) headtitle = textline headdescription = pp.ZeroOrMore(textline) head = pp.Group(headtitle + underline + headdescription) taskname = pp.OneOrMore(dotnum) + textline task = pp.Forward() subtask = pp.Group(dotnum + task) task (taskname + pp.ZeroOrMore(subtask)) maintask = pp.Group(pp.LineStart() + task) parser = pp.OneOrMore(maintask) return head, parser text = My Title Text on a longer line of several words. More test and more. text2 = 1. Task 1 1.1. Subtask 1.1.1. More tasks. 1.2. Another subtask 2. Task 2 2.1. Subtask again head, parser = grammar() print head.parseString(text) print parser.parseString(text2) comb = head + pp.OneOrMore(pp.LineStart() + pp.restOfLine) + parser print comb.parseString(text + text2) #=== Now the first two print statements output the parse tree as I would expect, but the combined parser fails with an exception: Traceback (most recent call last): File parser.py, line 50, in ? print comb.parseString(text + text2) . . [Stacktrace snipped] . raise exc pyparsing.ParseException: Expected start of line (at char 81), (line:9, col:1) Any help appreciated! Cheers, -- Ant. I hold that the + operator should be overloaded for strings to include newlines. Python 3.0 print has parentheses around it; wouldn't it make sense to take them out? -- http://mail.python.org/mailman/listinfo/python-list
Re: Pyparsing Question
Hi Paul, LineStart *really* wants to be parsed at the beginning of a line. Your textline reads up to but not including the LineEnd. Try making these changes. 1. Change textline to: textline = pp.Combine( pp.Group(pp.Word(pp.alphas, pp.printables) + pp.restOfLine)) + \ pp.LineEnd().suppress() Ah - so restOfLine excludes the actual line ending does it? 2. Change comb to: comb = head + parser Yes - I'd got this originally. I added the garbage to try to fix the problem and forgot to take it back out! Thanks for the advice - it works fine now, and will provide a base for extending the list format. Thanks, Ant... -- http://mail.python.org/mailman/listinfo/python-list
Re: Pyparsing Question
On May 16, 10:45 am, Ant [EMAIL PROTECTED] wrote: Hi Paul, LineStart *really* wants to be parsed at the beginning of a line. Your textline reads up to but not including the LineEnd. Try making these changes. 1. Change textline to: textline = pp.Combine( pp.Group(pp.Word(pp.alphas, pp.printables) + pp.restOfLine)) + \ pp.LineEnd().suppress() Ah - so restOfLine excludes the actual line ending does it? 2. Change comb to: comb = head + parser Yes - I'd got this originally. I added the garbage to try to fix the problem and forgot to take it back out! Thanks for the advice - it works fine now, and will provide a base for extending the list format. Thanks, Ant... There is a possibility that spirals can come from doubles, which could be non-trivially useful, in par. in the Java library. I won't see a cent. Can anyone start a thread to spin letters, and see what the team looks like? I want to animate spinners. It's across dimensions. (per something.) Swipe a cross in a fluid. I'm draw crosses. Animate cubes to draw crosses. I.e. swipe them. -- http://mail.python.org/mailman/listinfo/python-list
Re: pyparsing question
On Jan 1, 4:18 pm, John Machin [EMAIL PROTECTED] wrote: On Jan 2, 10:32 am, hubritic [EMAIL PROTECTED] wrote: The data I have has a fixed number of characters per field, so I could split it up that way, but wouldn't that defeat the purpose of using a parser? The purpose of a parser is to parse. Data in fixed columns does not need parsing. I am determined to become proficient with pyparsing so I am using it even when it could be considered overkill; thus, it has gone past mere utility now, this is a matter of principle! An extremely misguided principle. Would you use an AK47 on the flies around your barbecue? A better principle is to choose the best tool for the job. Your principle is no doubt the saner one for the real world, but your example of AK47 is a bit off. We generally know enough about an AK47 to know that it is not something to kill flies with. Consider, though, if someone unfamiliar with the concept of guns and mayhem got an AK47 for xmas and was only told that it was really good for killing things. He would try it out and would discover that indeed it kills all sorts of things. So he might try killing flies. Then he would discover the limitations; those already familiar with guns would wonder why he would waste his time. -- http://mail.python.org/mailman/listinfo/python-list
Re: pyparsing question
On Jan 1, 5:32 pm, hubritic [EMAIL PROTECTED] wrote: I am trying to parse data that looks like this: IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION 2BFA76F6 1208230607 T S SYSPROC SYSTEM SHUTDOWN BY USER A6D1BD62 1215230807 I H Firmware Event snip The data I have has a fixed number of characters per field, so I could split it up that way, but wouldn't that defeat the purpose of using a parser? I think you have this backwards. I use pyparsing for a lot of text processing, but if it is not a good fit, or if str.split is all that is required, there is no real rationale for using anything more complicated. I am determined to become proficient with pyparsing so I am using it even when it could be considered overkill; thus, it has gone past mere utility now, this is a matter of principle! Well, I'm glad you are driven to learn pyparsing if it kills you, but John Machin has a good point. This data is really so amenable to something as simple as: for line in logfile: id,timestamp,t,c resource_and_description = line.split(None,4) that it is difficult to recommend pyparsing for this case. The sample you posted was space-delimited, but if it is tab-delimited, and there is a pair of tabs between the H and Firmware Event on the second line, then just use split(\t) for your data and be done. Still, pyparsing may be helpful in disambiguating that RESOURCE_NAME and DESCRIPTION text. One approach would be to enumerate (if possible) the different values of RESOURCE_NAME. Something like this: ident = Word(alphanums) timestamp = Word(nums,exact=10) # I don't know what these are, I'm just getting the values # from the sample text you posted t_field = oneOf(T I) c_field = oneOf(S H) # I'm just guessing here, you'll need to provide the actual # values from your log file resource_name = oneOf(SYSPROC USERPROC IOSUBSYS whatever) logline = ident(identifier) + timestamp(time) + \ t_field(T) + c_field(C) + \ Optional(resource_name, default=)(resource) + \ Optional(restOfLine, default=)(description) Another tack to take might be to use a parse action on the resource name, to verify the column position of the found token by using the pyparsing method col: def matchOnlyAtCol(n): def verifyCol(strg,locn,toks): if col(locn,strg) != n: raise ParseException(strg,locn,matched token not at column %d % n) return verifyCol resource_name = Word(alphas).setParseAction(matchOnlyAtCol(35)) This will only work if your data really is columnar - the example text that you posted isn't. (Hmm, I like that matchOnlyAtCol method, I think I'll add that to the next release of pyparsing...) Here are some similar parsers that might give you some other ideas: http://pyparsing.wikispaces.com/space/showimage/httpServerLogParser.py http://mail.python.org/pipermail/python-list/2005-January/thread.html#301450 In the second link, I made a similar remark, that pyparsing may not be the first tool to try, but the variability of the input file made the non-pyparsing options pretty hairy-looking with special case code, so in the end, pyparsing was no more complex to use. Good luck! -- Paul -- http://mail.python.org/mailman/listinfo/python-list
pyparsing question
I am trying to parse data that looks like this: IDENTIFIERTIMESTAMP T C RESOURCE_NAME DESCRIPTION 2BFA76F6 1208230607 T S SYSPROCSYSTEM SHUTDOWN BY USER A6D1BD62 1215230807 I HFirmware Event My problem is that sometimes there is a RESOURCE_NAME and sometimes not, so I wind up with Firmware as my RESOURCE_NAME and Event as my DESCRIPTION. The formating seems to use a set number of spaces. I have tried making RESOURCE_NAME an Optional(Word(alphanums))) and Description OneOrMore(Word(alphas) + LineEnd(). So the question is, how can I avoid having the first word of Description sucked into RESOURCE_NAME when that field should be blank? The data I have has a fixed number of characters per field, so I could split it up that way, but wouldn't that defeat the purpose of using a parser? I am determined to become proficient with pyparsing so I am using it even when it could be considered overkill; thus, it has gone past mere utility now, this is a matter of principle! thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: pyparsing question
On Jan 1, 2008 6:32 PM, hubritic [EMAIL PROTECTED] wrote: I am trying to parse data that looks like this: IDENTIFIERTIMESTAMP T C RESOURCE_NAME DESCRIPTION 2BFA76F6 1208230607 T S SYSPROCSYSTEM SHUTDOWN BY USER A6D1BD62 1215230807 I HFirmware Event My problem is that sometimes there is a RESOURCE_NAME and sometimes not, so I wind up with Firmware as my RESOURCE_NAME and Event as my DESCRIPTION. The formating seems to use a set number of spaces. The data I have has a fixed number of characters per field, so I could split it up that way, but wouldn't that defeat the purpose of using a parser? I am determined to become proficient with pyparsing so I am using it even when it could be considered overkill; thus, it has gone past mere utility now, this is a matter of principle! If your data is really in fixed-size columns, then pyparsing is the wrong tool. There's no standard Python tool for reading and writing fixed-length field flatfile data files, but it's pretty simple to use named slices to get at the data. identifier = slice(0, 8) timestamp = slice(8, 18) t = slice(18, 21) c = slice(21, 24) resource_name = slice(24, 35) description = slice(35) for line in file: line = line.rstrip(\n) print id:, line[identifier] print timestamp:, line[timestamp] ...etc... -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: pyparsing question
On Jan 1, 2008 6:54 PM, Neil Cerutti [EMAIL PROTECTED] wrote: There's no standard Python tool for reading and writing fixed-length field flatfile data files, but it's pretty simple to use named slices to get at the data. identifier = slice(0, 8) timestamp = slice(8, 18) t = slice(18, 21) c = slice(21, 24) resource_name = slice(24, 35) description = slice(35) Oops! I made an errant stab at the slice constructor. That last should be 'slice(35, None)'. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: pyparsing question
On Jan 2, 10:32 am, hubritic [EMAIL PROTECTED] wrote: The data I have has a fixed number of characters per field, so I could split it up that way, but wouldn't that defeat the purpose of using a parser? The purpose of a parser is to parse. Data in fixed columns does not need parsing. I am determined to become proficient with pyparsing so I am using it even when it could be considered overkill; thus, it has gone past mere utility now, this is a matter of principle! An extremely misguided principle. Would you use an AK47 on the flies around your barbecue? A better principle is to choose the best tool for the job. -- http://mail.python.org/mailman/listinfo/python-list
Re: Pyparsing Question.
Welcome to pyparsing! The simplest way to implement a markup processor in pyparsing is to define the grammar of the markup, attach a parse action to each markup type to convert the original markup to the actual results, and then use transformString to run through the input and do the conversion. This discussion topic has some examples: http://pyparsing.wikispaces.com/message/view/home/31853. Thanks for the pointers - I had a look through the examples on the pyparsing website, but none seemed to show a simple example of this kind of thing. The discussion topic you noted above is exactly the sort of thing I was after! Cheers, -- http://mail.python.org/mailman/listinfo/python-list
Pyparsing Question.
I have a home-grown Wiki that I created as an excercise, with it's own wiki markup (actually just a clone of the Trac wiki markup). The wiki text parser I wrote works nicely, but makes heavy use of regexes, tags and stacks to parse the text. As such it is a bit of a mantainability nightmare - adding new wiki constructs can be a bit painful. So I thought I'd look into the pyparsing module, but can't find a simple example of processing random text. For example, I want to parse the following: Some random text and '''some bold text''' and some more random text into: Some random text and strongsome bold text/strong and some more random text I have the following as a starting point: from pyparsing import * def parse(text): italics = QuotedString(quoteChar='') parser = Optional(italics) parsed_text = parser.parseString(text) print parse(Test this is '''bold''' but this is not.) So if you could provide a bit of a starting point, I'd be grateful! Cheers, -- http://mail.python.org/mailman/listinfo/python-list
Re: Pyparsing Question.
Ant wrote: So I thought I'd look into the pyparsing module, but can't find a simple example of processing random text. Have you looked at the examples on the pyparsing web page? Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Pyparsing Question.
Ant [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] I have a home-grown Wiki that I created as an excercise, with it's own wiki markup (actually just a clone of the Trac wiki markup). The wiki text parser I wrote works nicely, but makes heavy use of regexes, tags and stacks to parse the text. As such it is a bit of a mantainability nightmare - adding new wiki constructs can be a bit painful. So I thought I'd look into the pyparsing module, but can't find a simple example of processing random text. For example, I want to parse the following: Some random text and '''some bold text''' and some more random text into: Some random text and strongsome bold text/strong and some more random text I have the following as a starting point: from pyparsing import * def parse(text): italics = QuotedString(quoteChar='') parser = Optional(italics) parsed_text = parser.parseString(text) print parse(Test this is '''bold''' but this is not.) So if you could provide a bit of a starting point, I'd be grateful! Cheers, Ant, Welcome to pyparsing! The simplest way to implement a markup processor in pyparsing is to define the grammar of the markup, attach a parse action to each markup type to convert the original markup to the actual results, and then use transformString to run through the input and do the conversion. This discussion topic has some examples: http://pyparsing.wikispaces.com/message/view/home/31853. -- Paul -- http://mail.python.org/mailman/listinfo/python-list