Re: Regex help needed!
# http://gist.github.com/271661 import lxml.html import re src = """ lksjdfls kdjff lsdfs sdjfls sdfsdwelcome hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 """ regex = re.compile('amazon_(\d+)') doc = lxml.html.document_fromstring(src) for div in doc.xpath('//div[starts-with(@id, "amazon_")]'): match = regex.match(div.get('id')) if match: print match.groups()[0] On Thu, Jan 7, 2010 at 4:42 PM, Aahz wrote: > In article > <19de1d6e-5ba9-42b5-9221-ed7246e39...@u36g2000prn.googlegroups.com>, > Oltmans wrote: >> >>I've written this regex that's kind of working >>re.findall("\w+\s*\W+amazon_(\d+)",str) >> >>but I was just wondering that there might be a better RegEx to do that >>same thing. Can you kindly suggest a better/improved Regex. Thank you >>in advance. > > 'Some people, when confronted with a problem, think "I know, I'll use > regular expressions." Now they have two problems.' > --Jamie Zawinski > > Take the advice other people gave you and use BeautifulSoup. > -- > Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/ > > "If you think it's expensive to hire a professional to do the job, wait > until you hire an amateur." --Red Adair > -- > http://mail.python.org/mailman/listinfo/python-list > -- Rolando Espinoza La fuente www.rolandoespinoza.info -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
In article <19de1d6e-5ba9-42b5-9221-ed7246e39...@u36g2000prn.googlegroups.com>, Oltmans wrote: > >I've written this regex that's kind of working >re.findall("\w+\s*\W+amazon_(\d+)",str) > >but I was just wondering that there might be a better RegEx to do that >same thing. Can you kindly suggest a better/improved Regex. Thank you >in advance. 'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.' --Jamie Zawinski Take the advice other people gave you and use BeautifulSoup. -- Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/ "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur." --Red Adair -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
On 21.12.2009 12:38, Oltmans wrote: Hello,. everyone. I've a string that looks something like lksjdfls kdjff lsdfs sdjflssdfsdwelcome > From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall("\w+\s*\W+amazon_(\d+)",str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. If you filter in two or even more sequential steps the problem becomes a lot simpler, not least because you can test each step separately: >>> r1 = re.compile (']*') # Add ignore case and variable white space >>> r2 = re.compile ('\d+') >>> [r2.search (item).group () for item in r1.findall (s) if item] # s is your sample ['345343', '35343433', '8898'] # Supposing all ids have digits Frederic -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
On Dec 21, 5:38 am, Oltmans wrote: > Hello,. everyone. > > I've a string that looks something like > > lksjdfls kdjff lsdfs sdjfls = "amazon_35343433">sdfsdwelcome > > > From above string I need the digits within the ID attribute. For > example, required output from above string is > - 35343433 > - 345343 > - 8898 > > I've written this regex that's kind of working > re.findall("\w+\s*\W+amazon_(\d+)",str) > The issue with using regexen for parsing HTML is that you often get surprised by attributes that you never expected, or out of order, or with weird or missing quotation marks, or tags or attributes that are in upper/lower case. BeautifulSoup is one tool to use for HTML scraping, here is a pyparsing example, with hopefully descriptive comments: from pyparsing import makeHTMLTags,ParseException src = """ lksjdfls kdjff lsdfs sdjfls sdfsdwelcome hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 """ # use makeHTMLTags to return an expression that will match # HTML tags, including attributes, upper/lower case, # etc. (makeHTMLTags will return expressions for both # opening and closing tags, but we only care about the # opening one, so just use the [0]th returned item div = makeHTMLTags("div")[0] # define a parse action to filter only for tags # with the proper id form def filterByIdStartingWithAmazon(tokens): if not tokens.id.startswith("amazon_"): raise ParseException( "must have id attribute starting with 'amazon_'") # define a parse action that will add a pseudo- # attribute 'amazon_id', to make it easier to get the # numeric portion of the id after the leading 'amazon_' def makeAmazonIdAttribute(tokens): tokens["amazon_id"] = tokens.id[len("amazon_"):] # attach parse action callbacks to the div expression - # these will be called during parse time div.setParseAction(filterByIdStartingWithAmazon, makeAmazonIdAttribute) # search through the input string for matching s, # and print out their amazon_id's for divtag in div.searchString(src): print divtag.amazon_id Prints: 345343 35343433 8898 -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
how about re.findall(r'\w+.=\W\D+(\d+)?',str) ? this will work for any string within id ! ~Ukanth On Dec 21, 6:06 pm, Oltmans wrote: > On Dec 21, 5:05 pm, Umakanth wrote: > > > How about re.findall(r'\d+(?:\.\d+)?',str) > > > extracts only numbers from any string > > Thank you. However, I only need the digits within the ID attribute of > the DIV. Regex that you suggested fails on the following string > > > lksjdfls kdjff lsdfs sdjfls = "amazon_35343433">sdfsdwelcome > hello, my age is 86 years old and I was born in 1945. Do you know that > PI is roughly 3.1443534534534534534 > > > > ~uk > > > On Dec 21, 4:38 pm, Oltmans wrote: > > > > Hello,. everyone. > > > > I've a string that looks something like > > > > > > lksjdfls kdjff lsdfs sdjfls > > = "amazon_35343433">sdfsdwelcome > > > > > > > From above string I need the digits within the ID attribute. For > > > example, required output from above string is > > > - 35343433 > > > - 345343 > > > - 8898 > > > > I've written this regex that's kind of working > > > re.findall("\w+\s*\W+amazon_(\d+)",str) > > > > but I was just wondering that there might be a better RegEx to do that > > > same thing. Can you kindly suggest a better/improved Regex. Thank you > > > in advance. > > -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
> Oltmans wrote: > >I've a string that looks something like > > > >lksjdfls kdjff lsdfs sdjfls >= "amazon_35343433">sdfsdwelcome > > > > > >>From above string I need the digits within the ID attribute. For > >example, required output from above string is > >- 35343433 > >- 345343 > >- 8898 > > Your string is in /tmp/y in this example: $ grep -o [0-9]+ /tmp/y 345343 35343433 8898 Much simpler, isn't it? But that is not python. Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch "And there were in the same country shepherds abiding in the field, keeping watch over their flock by night. And, lo, the angel of the Lord came upon them, and the glory of the Lord shone round about them: and they were sore afraid. And the angel said unto them, Fear not: for behold I bring you good tidings of great joy, which shall be to all people. For unto you is born this day in the city of David a Saviour, which is Christ the Lord."Luke 2:8-11 -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
Oltmans wrote: Hello,. everyone. I've a string that looks something like lksjdfls kdjff lsdfs sdjfls sdfsdwelcome From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall("\w+\s*\W+amazon_(\d+)",str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. Try: re.findall(r"", str) You shouldn't be using 'str' as a variable name because it hides the builtin string class 'str'. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
Ok. how about re.findall(r'\w+_(\d+)',str) ? returns ['345343', '35343433', '8898', '8898'] ! On Dec 21, 6:06 pm, Oltmans wrote: > On Dec 21, 5:05 pm, Umakanth wrote: > > > How about re.findall(r'\d+(?:\.\d+)?',str) > > > extracts only numbers from any string > > Thank you. However, I only need the digits within the ID attribute of > the DIV. Regex that you suggested fails on the following string > > > lksjdfls kdjff lsdfs sdjfls = "amazon_35343433">sdfsdwelcome > hello, my age is 86 years old and I was born in 1945. Do you know that > PI is roughly 3.1443534534534534534 > > > > ~uk > > > On Dec 21, 4:38 pm, Oltmans wrote: > > > > Hello,. everyone. > > > > I've a string that looks something like > > > > > > lksjdfls kdjff lsdfs sdjfls > > = "amazon_35343433">sdfsdwelcome > > > > > > > From above string I need the digits within the ID attribute. For > > > example, required output from above string is > > > - 35343433 > > > - 345343 > > > - 8898 > > > > I've written this regex that's kind of working > > > re.findall("\w+\s*\W+amazon_(\d+)",str) > > > > but I was just wondering that there might be a better RegEx to do that > > > same thing. Can you kindly suggest a better/improved Regex. Thank you > > > in advance. > > -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
On Dec 21, 5:05 pm, Umakanth wrote: > How about re.findall(r'\d+(?:\.\d+)?',str) > > extracts only numbers from any string > Thank you. However, I only need the digits within the ID attribute of the DIV. Regex that you suggested fails on the following string lksjdfls kdjff lsdfs sdjfls sdfsdwelcome hello, my age is 86 years old and I was born in 1945. Do you know that PI is roughly 3.1443534534534534534 > ~uk > > On Dec 21, 4:38 pm, Oltmans wrote: > > > Hello,. everyone. > > > I've a string that looks something like > > > > lksjdfls kdjff lsdfs sdjfls > = "amazon_35343433">sdfsdwelcome > > > > > From above string I need the digits within the ID attribute. For > > example, required output from above string is > > - 35343433 > > - 345343 > > - 8898 > > > I've written this regex that's kind of working > > re.findall("\w+\s*\W+amazon_(\d+)",str) > > > but I was just wondering that there might be a better RegEx to do that > > same thing. Can you kindly suggest a better/improved Regex. Thank you > > in advance. > > -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
Oltmans wrote: > I've a string that looks something like > > lksjdfls kdjff lsdfs sdjfls = "amazon_35343433">sdfsdwelcome > > > From above string I need the digits within the ID attribute. For > example, required output from above string is > - 35343433 > - 345343 > - 8898 > > I've written this regex that's kind of working > re.findall("\w+\s*\W+amazon_(\d+)",str) > > but I was just wondering that there might be a better RegEx to do that > same thing. Can you kindly suggest a better/improved Regex. Thank you > in advance. >>> from BeautifulSoup import BeautifulSoup >>> bs = BeautifulSoup("""lksjdfls kdjff lsdfs sdjfls sdfsdwelcome""") >>> [node["id"][7:] for node in bs(id=lambda id: id.startswith("amazon_"))] [u'345343', u'35343433', u'8898'] I think BeautifulSoup is a better tool for the task since it actually "understands" HTML. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
On Dec 21, 7:38 pm, Oltmans wrote: > Hello,. everyone. > > I've a string that looks something like > > lksjdfls kdjff lsdfs sdjfls = "amazon_35343433">sdfsdwelcome > > > From above string I need the digits within the ID attribute. For > example, required output from above string is > - 35343433 > - 345343 > - 8898 > > I've written this regex that's kind of working > re.findall("\w+\s*\W+amazon_(\d+)",str) > > but I was just wondering that there might be a better RegEx to do that > same thing. Can you kindly suggest a better/improved Regex. Thank you > in advance. don't need regular expression. just do a split on amazon >>> s="""lksjdfls kdjff lsdfs sdjfls >> = "amazon_35343433">sdfsdwelcome""" >>> for item in s.split("amazon_")[1:]: ... print item ... 345343'> kdjff lsdfs sdjfls sdfsdwelcome then find ' or " indices and do index slicing. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed!
How about re.findall(r'\d+(?:\.\d+)?',str) extracts only numbers from any string ~uk On Dec 21, 4:38 pm, Oltmans wrote: > Hello,. everyone. > > I've a string that looks something like > > lksjdfls kdjff lsdfs sdjfls = "amazon_35343433">sdfsdwelcome > > > From above string I need the digits within the ID attribute. For > example, required output from above string is > - 35343433 > - 345343 > - 8898 > > I've written this regex that's kind of working > re.findall("\w+\s*\W+amazon_(\d+)",str) > > but I was just wondering that there might be a better RegEx to do that > same thing. Can you kindly suggest a better/improved Regex. Thank you > in advance. -- http://mail.python.org/mailman/listinfo/python-list
Regex help needed!
Hello,. everyone. I've a string that looks something like lksjdfls kdjff lsdfs sdjfls sdfsdwelcome >From above string I need the digits within the ID attribute. For example, required output from above string is - 35343433 - 345343 - 8898 I've written this regex that's kind of working re.findall("\w+\s*\W+amazon_(\d+)",str) but I was just wondering that there might be a better RegEx to do that same thing. Can you kindly suggest a better/improved Regex. Thank you in advance. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
rh0dium wrote: > Michael Spencer wrote: >> >>> def parse(source): >> ... source = source.splitlines() >> ... original, rest = source[0], "\n".join(source[1:]) >> ... return original, rest_eval(get_tokens(rest)) > > This is a very clean and elegant way to separate them - Very nice!! I > like this alot - I will definately use this in the future!! > >> Cheers >> >> Michael > On reflection, this simplifies further (to 9 lines), at least for the test cases your provide, which don't involve any nested parens: >>> import cStringIO, tokenize ... >>> def get_tokens2(source): ... src = cStringIO.StringIO(source).readline ... src = tokenize.generate_tokens(src) ... return [token[1][1:-1] for token in src if token[0] == tokenize.STRING] ... >>> def parse2(source): ... source = source.splitlines() ... original, rest = source[0], "\n".join(source[1:]) ... return original, get_tokens2(rest) ... >>> This matches your main function for the three tests where main works... >>> for source in sources[:3]: #matches your main function where it works ... assert parse2(source) == main(source) ... Original someFunction Orig someFunction Results ['test', 'foo'] Original someFunction Orig someFunction Results ['test foo'] Original someFunction Orig someFunction Results ['test', 'test1', 'foo aasdfasdf', 'newline', 'test2'] ...and handles the case where main fails (I think correctly, although I'm not entirely sure what your desired output is in this case: >>> parse2(sources[3]) ('getVersion()', ['@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $']) >>> If you really do need nested parens, then you'd need the slightly longer version I posted earlier Cheers Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
"rh0dium" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > > Paul McGuire wrote: > > > ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR ) > > This will only work for a word with a parentheses ( ie. somefunction() > ) > > > If you *really* want everything on the first line to be the ident, try this: > > > > ident = Word(alpha,alphanums+"_") + restOfLine > > or > > ident = Combine( Word(alpha,alphanums+"_") + restOfLine ) > > This nicely grabs the "\r".. How can I get around it? > > > Now the next step is to assign field names to the results: > > > > dataFormat = ident.setResultsName("ident") + ( dblQuotedString | > > quoteList ).setResultsName("contents") > > This is super cool!! > > So let's take this for example > > test= 'fprintf( outFile "leSetInstSelectable( t )\n" )\r\n ("test" > "test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n' > > Now I want the ident to pull out 'fprintf( outFile > "leSetInstSelectable( t )\n" )' so I tried to do this? > > ident = Forward() > ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore( > dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR) > > Borrowing from the example listed previously. But it bombs out cause > it wants a ")" but it has one.. Forward() ROCKS!! > > Also how does it know to do this for just the first line? It would > seem that this will work for every line - No? > This works for me: test4 = r"""fprintf( outFile "leSetInstSelectable( t )\n" ) ("test" "test1" "foo aasdfasdf" "newline" "test2") """ ident = Forward() ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore( dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR) dataFormat = ident + ( dblQuotedString | quoteList ) print dataFormat.parseString(test4) Prints: [['fprintf', '(', 'outFile', '"leSetInstSelectable( t )\\n"', ')'], ['"test"', '"test1"', '"foo aasdfasdf"', '"newline"', '"test2"']] 1. Is there supposed to be a real line break in the string "leSetInstSelectable( t )\n", or just a slash-n at the end? pyparsing quoted strings do not accept multiline quotes, but they do accept escaped characters such as "\t" "\n", etc. That is, to pyparsing: "\n this is a valid \t \n string" "this is not a valid string" Part of the confusion is that your examples include explicit \r\n characters. I'm assuming this is to reflect what you see when listing out the Python variable containing the string. (Are you opening a text file with "rb" to read in binary? Try opening with just "r", and this may resolve your \r\n problems.) 2. If restOfLine is still giving you \r's at the end, you can redefine restOfLine to not include them, or to include and suppress them. Or (this is easier) define a parse action for restOfLine that strips trailing \r's: def stripTrailingCRs(st,loc,toks): try: if toks[0][-1] == '\r': return toks[0][:-1] except: pass restOfLine.setParseAction( stripTrailingCRs ) 3. How does it know to only do it for the first line? Presumably you told it to do so. pyparsing's parseString method starts at the beginning of the input string, and matches expressions until it finds a mismatch, or runs out of expressions to match - even if there is more input string to process, pyparsing does not continue. To search through the whole file looking for idents, try using scanString which returns a generator; for each match, the generator gives a tuple containing: - tokens - the matched tokens - start - the start location of the match - end - the end location of the match If your input file consists *only* of these constructs, you can also just expand dataFormat.parseString to OneOrMore(dataFormat).parseString. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
Michael Spencer wrote: > >>> def parse(source): > ... source = source.splitlines() > ... original, rest = source[0], "\n".join(source[1:]) > ... return original, rest_eval(get_tokens(rest)) This is a very clean and elegant way to separate them - Very nice!! I like this alot - I will definately use this in the future!! > > Cheers > > Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
Paul McGuire wrote: > ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR ) This will only work for a word with a parentheses ( ie. somefunction() ) > If you *really* want everything on the first line to be the ident, try this: > > ident = Word(alpha,alphanums+"_") + restOfLine > or > ident = Combine( Word(alpha,alphanums+"_") + restOfLine ) This nicely grabs the "\r".. How can I get around it? > Now the next step is to assign field names to the results: > > dataFormat = ident.setResultsName("ident") + ( dblQuotedString | > quoteList ).setResultsName("contents") This is super cool!! So let's take this for example test= 'fprintf( outFile "leSetInstSelectable( t )\n" )\r\n ("test" "test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n' Now I want the ident to pull out 'fprintf( outFile "leSetInstSelectable( t )\n" )' so I tried to do this? ident = Forward() ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore( dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR) Borrowing from the example listed previously. But it bombs out cause it wants a ")" but it has one.. Forward() ROCKS!! Also how does it know to do this for just the first line? It would seem that this will work for every line - No? -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
rh0dium wrote: > Hi all, > > I am using python to drive another tool using pexpect. The values > which I get back I would like to automatically put into a list if there > is more than one return value. They provide me a way to see that the > data is in set by parenthesising it. > ... > > CAN SOMEONE PLEASE CLEAN THIS UP? > How about using the Python tokenizer rather than re: >>> import cStringIO, tokenize ... >>> def get_tokens(source): ... allowed_tokens = (tokenize.STRING, tokenize.OP) ... src = cStringIO.StringIO(source).readline ... src = tokenize.generate_tokens(src) ... return (token[1] for token in src if token[0] in allowed_tokens) ... >>> def rest_eval(tokens): ... output = [] ... for token in tokens: ... if token == "(": ... output.append(rest_eval(tokens)) ... elif token == ")": ... return output ... else: ... output.append(token[1:-1]) ... return output ... >>> def parse(source): ... source = source.splitlines() ... original, rest = source[0], "\n".join(source[1:]) ... return original, rest_eval(get_tokens(rest)) ... >>> sources = [ ... 'someFunction\r\n "test" "foo"\r\n', ... 'someFunction\r\n "test foo"\r\n', ... 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $"\r\n', ... 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n'] >>> >>> for data in sources: parse(data) ... ('someFunction', ['test', 'foo']) ('someFunction', ['test foo']) ('getVersion()', ['@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $']) ('someFunction', [['test', 'test1', 'foo aasdfasdf', 'newline', 'test2']]) >>> Cheers Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
"rh0dium" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > > Paul McGuire wrote: > > -- Paul > > (Download pyparsing at http://pyparsing.sourceforge.net.) > > Done. > > > Hey this is pretty cool! I have one small problem that I don't know > how to resolve. I want the entire contents (whatever it is) of line 1 > to be the ident. Now digging into the code showed a method line, > lineno and LineStart LineEnd. I tried to use all three but it didn't > work for a few reasons ( line = type issues, lineno - I needed the data > and could't get it to work, LineStart/End - I think it matches every > line and I need the scope to line 1 ) > > So here is my rendition of the code - But this is REALLY slick.. > > I think the problem is the parens on line one > > def main(data=None): > > LPAR = Literal("(") > RPAR = Literal(")") > > # assume function identifiers must start with alphas, followed by > zero or more > # alphas, numbers, or '_' - expand this defn as needed > ident = LineStart + LineEnd > > # define a list as one or more quoted strings, inside ()'s - we'll > tackle nesting > # in a minute > quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) + > RPAR.suppress()) > > # define format of a line of data - don't bother with \n's or \r's, > > # pyparsing just skips 'em > dataFormat = ident + ( dblQuotedString | quoteList ) > > return dataFormat.parseString(data) > > > # General run.. > if __name__ == '__main__': > > > # data = 'someFunction\r\n "test" "foo"\r\n' > # data = 'someFunction\r\n "test foo"\r\n' > data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0 > 05/22/2005 23:36 (cicln01) $"\r\n' > # data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n > "newline" "test2")\r\n' > > foo = main(data) > > print foo > LineStart() + LineEnd() will only match an empty line. If you describe in words what you want ident to be, it may be more natural to translate to pyparsing. "A word starting with an alpha, followed by zero or more alphas, numbers, or '_'s, with a trailing pair of parens" ident = Word(alpha,alphanums+"_") + LPAR + RPAR If you want the ident all combined into a single token, use: ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR ) LineStart and LineEnd are geared more for line-oriented or whitespace-sensitive grammars. Your example doesn't really need them, I don't think. If you *really* want everything on the first line to be the ident, try this: ident = Word(alpha,alphanums+"_") + restOfLine or ident = Combine( Word(alpha,alphanums+"_") + restOfLine ) Now the next step is to assign field names to the results: dataFormat = ident.setResultsName("ident") + ( dblQuotedString | quoteList ).setResultsName("contents") test = "blah blah test string" results = dataFormat.parseString(test) print results.ident, results.contents I'm glad pyparsing is working out for you! There should be a number of examples that ship with pyparsing that may give you some more ideas on how to proceed from here. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
Paul McGuire wrote: > -- Paul > (Download pyparsing at http://pyparsing.sourceforge.net.) Done. Hey this is pretty cool! I have one small problem that I don't know how to resolve. I want the entire contents (whatever it is) of line 1 to be the ident. Now digging into the code showed a method line, lineno and LineStart LineEnd. I tried to use all three but it didn't work for a few reasons ( line = type issues, lineno - I needed the data and could't get it to work, LineStart/End - I think it matches every line and I need the scope to line 1 ) So here is my rendition of the code - But this is REALLY slick.. I think the problem is the parens on line one def main(data=None): LPAR = Literal("(") RPAR = Literal(")") # assume function identifiers must start with alphas, followed by zero or more # alphas, numbers, or '_' - expand this defn as needed ident = LineStart + LineEnd # define a list as one or more quoted strings, inside ()'s - we'll tackle nesting # in a minute quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) + RPAR.suppress()) # define format of a line of data - don't bother with \n's or \r's, # pyparsing just skips 'em dataFormat = ident + ( dblQuotedString | quoteList ) return dataFormat.parseString(data) # General run.. if __name__ == '__main__': # data = 'someFunction\r\n "test" "foo"\r\n' # data = 'someFunction\r\n "test foo"\r\n' data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $"\r\n' # data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n' foo = main(data) print foo -- http://mail.python.org/mailman/listinfo/python-list
Re: Regex help needed
"rh0dium" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hi all, > > I am using python to drive another tool using pexpect. The values > which I get back I would like to automatically put into a list if there > is more than one return value. They provide me a way to see that the > data is in set by parenthesising it. > Well, you asked for regex help, but a pyparsing rendition may be easier to read and maintain. -- Paul (Download pyparsing at http://pyparsing.sourceforge.net.) # test data strings test1 = """somefunction() "@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $" """ test2 = """somefunction() ("." "~" "/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile" "foo") """ test3 = """somefunctionWithNestedlist() ("." "~" "/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile" ("Hey!" "this is a nested" "list") "foo") """ """ So if you're still reading this I want to parse out data. Here are the rules... - Line 1 ALWAYS is the calling function whatever is there (except "\r\n") should be kept as "original" - Anything may occur inside the quotations - I don't care what's in there per se but it must be maintained. - Parenthesed items I want to be pushed into a list. I haven't run into a case where you have nested paren's but that not to say it won't happen... """ from pyparsing import Literal, Word, alphas, alphanums, \ dblQuotedString, OneOrMore, Group, Forward LPAR = Literal("(") RPAR = Literal(")") # assume function identifiers must start with alphas, followed by zero or more # alphas, numbers, or '_' - expand this defn as needed ident = Word(alphas,alphanums+"_") # define a list as one or more quoted strings, inside ()'s - we'll tackle nesting # in a minute quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) + RPAR.suppress() ) # define format of a line of data - don't bother with \n's or \r's, # pyparsing just skips 'em dataFormat = ident + LPAR + RPAR + ( dblQuotedString | quoteList ) def test(t): print dataFormat.parseString(t) print "Parse flat lists" test(test1) test(test2) # modifications for nested lists quoteList = Forward() quoteList << Group( LPAR.suppress() + OneOrMore(dblQuotedString | quoteList) + RPAR.suppress() ) dataFormat = ident + LPAR + RPAR + ( dblQuotedString | quoteList ) print print "Parse using nested lists" test(test1) test(test2) test(test3) Parsing results: Parse flat lists ['somefunction', '(', ')', '"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $"'] ['somefunction', '(', ')', ['"."', '"~"', '"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', '"foo"']] Parse using nested lists ['somefunction', '(', ')', '"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $"'] ['somefunction', '(', ')', ['"."', '"~"', '"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', '"foo"']] ['somefunctionWithNestedlist', '(', ')', ['"."', '"~"', '"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', ['"Hey!"', '"this is a nested"', '"list"'], '"foo"']] -- http://mail.python.org/mailman/listinfo/python-list
Regex help needed
Hi all, I am using python to drive another tool using pexpect. The values which I get back I would like to automatically put into a list if there is more than one return value. They provide me a way to see that the data is in set by parenthesising it. This is all generated as I said using pexpect - Here is how I use it.. child = pexpect.spawn( _buildCadenceExe(), timeout=timeout) child.sendline("somefunction()") child.expect("> ") data=child.before Given this data can take on several shapes: Single return value -- THIS IS THE ONE I CAN'T GET TO WORK.. data = 'somefunction()\r\n"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $"\r\n' Multiple return value data = 'somefunction()\r\n("." "~" "/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile")\r\n' It may take up several lines... data = 'somefunction()\r\n("." "~" \r\n"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"\r\n"foo")\r\n' So if you're still reading this I want to parse out data. Here are the rules... - Line 1 ALWAYS is the calling function whatever is there (except "\r\n") should be kept as "original" - Anything may occur inside the quotations - I don't care what's in there per se but it must be maintained. - Parenthesed items I want to be pushed into a list. I haven't run into a case where you have nested paren's but that not to say it won't happen... So here is my code.. Pardon my hack job.. import os,re def main(data=None): # Get rid of the annoying \r's dat=data.split("\r") data="".join(dat) # Remove the first line - that is the original call dat = data.split("\n") original=dat[0] del dat[0] print "Original", original # Now join all of the remaining lines retl="".join(dat) # self.logger.debug("Original = \'%s\'" % original) try: # Get rid of the parenthesis parmatcher = re.compile( r'\(([^()]*)\)' ) parmatch = parmatcher.search(retl) # Get rid of the first and last quotes qrmatcher = re.compile( r'\"([^()]*)\"' ) qrmatch = qrmatcher.search(parmatch.group(1)) # Split the items qmatch=re.compile(r'\"\s+\"') results = qmatch.split(qrmatch.group(1)) except: qrmatcher = re.compile( r'\"([^()]*)\"' ) qrmatch = qrmatcher.search(retl) # Split the items qmatch=re.compile(r'\"\s+\"') results = qmatch.split(qrmatch.group(1)) print "Orig", original, "Results", results return original,results # General run.. if __name__ == '__main__': # data = 'someFunction\r\n "test" "foo"\r\n' # data = 'someFunction\r\n "test foo"\r\n' data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $"\r\n' # data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n' main(data) CAN SOMEONE PLEASE CLEAN THIS UP? -- http://mail.python.org/mailman/listinfo/python-list