Re: Regex help needed!

2010-01-07 Thread Rolando Espinoza La Fuente
# http://gist.github.com/271661

import lxml.html
import re

src = """
lksjdfls  kdjff lsdfs  sdjfls sdfsdwelcome
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """

regex = re.compile('amazon_(\d+)')

doc = lxml.html.document_fromstring(src)

for div in doc.xpath('//div[starts-with(@id, "amazon_")]'):
match = regex.match(div.get('id'))
if match:
print match.groups()[0]



On Thu, Jan 7, 2010 at 4:42 PM, Aahz  wrote:
> In article 
> <19de1d6e-5ba9-42b5-9221-ed7246e39...@u36g2000prn.googlegroups.com>,
> Oltmans   wrote:
>>
>>I've written this regex that's kind of working
>>re.findall("\w+\s*\W+amazon_(\d+)",str)
>>
>>but I was just wondering that there might be a better RegEx to do that
>>same thing. Can you kindly suggest a better/improved Regex. Thank you
>>in advance.
>
> 'Some people, when confronted with a problem, think "I know, I'll use
> regular expressions."  Now they have two problems.'
> --Jamie Zawinski
>
> Take the advice other people gave you and use BeautifulSoup.
> --
> Aahz (a...@pythoncraft.com)           <*>         http://www.pythoncraft.com/
>
> "If you think it's expensive to hire a professional to do the job, wait
> until you hire an amateur."  --Red Adair
> --
> http://mail.python.org/mailman/listinfo/python-list
>



-- 
Rolando Espinoza La fuente
www.rolandoespinoza.info
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2010-01-07 Thread Aahz
In article <19de1d6e-5ba9-42b5-9221-ed7246e39...@u36g2000prn.googlegroups.com>,
Oltmans   wrote:
>
>I've written this regex that's kind of working
>re.findall("\w+\s*\W+amazon_(\d+)",str)
>
>but I was just wondering that there might be a better RegEx to do that
>same thing. Can you kindly suggest a better/improved Regex. Thank you
>in advance.

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions."  Now they have two problems.'
--Jamie Zawinski

Take the advice other people gave you and use BeautifulSoup.
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur."  --Red Adair
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-24 Thread F.R.



On 21.12.2009 12:38, Oltmans wrote:

Hello,. everyone.

I've a string that looks something like

lksjdfls  kdjff lsdfs  sdjflssdfsdwelcome


> From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.
   


If you filter in two or even more sequential steps the problem becomes a 
lot simpler, not least because you can

test each step separately:

>>> r1 = re.compile (']*')   # Add ignore case and 
variable white space

>>> r2 = re.compile ('\d+')
>>> [r2.search (item).group () for item in r1.findall (s) if item] 
# s is your sample

['345343', '35343433', '8898'] # Supposing all ids have digits

Frederic

--
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-22 Thread Paul McGuire
On Dec 21, 5:38 am, Oltmans  wrote:
> Hello,. everyone.
>
> I've a string that looks something like
> 
> lksjdfls  kdjff lsdfs  sdjfls  =   "amazon_35343433">sdfsdwelcome
> 
>
> From above string I need the digits within the ID attribute. For
> example, required output from above string is
> - 35343433
> - 345343
> - 8898
>
> I've written this regex that's kind of working
> re.findall("\w+\s*\W+amazon_(\d+)",str)
>

The issue with using regexen for parsing HTML is that you often get
surprised by attributes that you never expected, or out of order, or
with weird or missing quotation marks, or tags or attributes that are
in upper/lower case.  BeautifulSoup is one tool to use for HTML
scraping, here is a pyparsing example, with hopefully descriptive
comments:


from pyparsing import makeHTMLTags,ParseException

src = """
lksjdfls  kdjff lsdfs  sdjfls sdfsdwelcome
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """

# use makeHTMLTags to return an expression that will match
# HTML  tags, including attributes, upper/lower case,
# etc. (makeHTMLTags will return expressions for both
# opening and closing tags, but we only care about the
# opening one, so just use the [0]th returned item
div = makeHTMLTags("div")[0]

# define a parse action to filter only for  tags
# with the proper id form
def filterByIdStartingWithAmazon(tokens):
if not tokens.id.startswith("amazon_"):
raise ParseException(
  "must have id attribute starting with 'amazon_'")

# define a parse action that will add a pseudo-
# attribute 'amazon_id', to make it easier to get the
# numeric portion of the id after the leading 'amazon_'
def makeAmazonIdAttribute(tokens):
tokens["amazon_id"] = tokens.id[len("amazon_"):]

# attach parse action callbacks to the div expression -
# these will be called during parse time
div.setParseAction(filterByIdStartingWithAmazon,
 makeAmazonIdAttribute)

# search through the input string for matching s,
# and print out their amazon_id's
for divtag in div.searchString(src):
print divtag.amazon_id


Prints:

345343
35343433
8898

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-22 Thread Umakanth
how about re.findall(r'\w+.=\W\D+(\d+)?',str) ?

this will work for any string within id !

~Ukanth

On Dec 21, 6:06 pm, Oltmans  wrote:
> On Dec 21, 5:05 pm, Umakanth  wrote:
>
> > How about re.findall(r'\d+(?:\.\d+)?',str)
>
> > extracts only numbers from any string
>
> Thank you. However, I only need the digits within the ID attribute of
> the DIV. Regex that you suggested fails on the following string
>
> 
> lksjdfls  kdjff lsdfs  sdjfls  =   "amazon_35343433">sdfsdwelcome
> hello, my age is 86 years old and I was born in 1945. Do you know that
> PI is roughly 3.1443534534534534534
> 
>
> > ~uk
>
> > On Dec 21, 4:38 pm, Oltmans  wrote:
>
> > > Hello,. everyone.
>
> > > I've a string that looks something like
> > > 
> > > lksjdfls  kdjff lsdfs  sdjfls  > > =   "amazon_35343433">sdfsdwelcome
> > > 
>
> > > From above string I need the digits within the ID attribute. For
> > > example, required output from above string is
> > > - 35343433
> > > - 345343
> > > - 8898
>
> > > I've written this regex that's kind of working
> > > re.findall("\w+\s*\W+amazon_(\d+)",str)
>
> > > but I was just wondering that there might be a better RegEx to do that
> > > same thing. Can you kindly suggest a better/improved Regex. Thank you
> > > in advance.
>
>

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-21 Thread Johann Spies
> Oltmans wrote:
> >I've a string that looks something like
> >
> >lksjdfls  kdjff lsdfs  sdjfls  >=   "amazon_35343433">sdfsdwelcome
> >
> >
> >>From above string I need the digits within the ID attribute. For
> >example, required output from above string is
> >- 35343433
> >- 345343
> >- 8898
> >

Your string is in /tmp/y in this example:

$ grep -o [0-9]+ /tmp/y
345343
35343433
8898

Much simpler, isn't it?  But that is not python.

Regards
Johann

-- 
Johann Spies  Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

 "And there were in the same country shepherds abiding 
  in the field, keeping watch over their flock by night.
  And, lo, the angel of the Lord came upon them, and the
  glory of the Lord shone round about them: and they were 
  sore afraid. And the angel said unto them, Fear not:
  for behold I bring you good tidings of great joy, which
  shall be to all people. For unto you is born this day 
  in the city of David a Saviour, which is Christ the 
  Lord."Luke 2:8-11 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-21 Thread MRAB

Oltmans wrote:

Hello,. everyone.

I've a string that looks something like

lksjdfls  kdjff lsdfs  sdjfls sdfsdwelcome



From above string I need the digits within the ID attribute. For

example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.


Try:

re.findall(r"", str)

You shouldn't be using 'str' as a variable name because it hides the
builtin string class 'str'.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-21 Thread Umakanth
Ok. how about re.findall(r'\w+_(\d+)',str) ?

returns ['345343', '35343433', '8898', '8898'] !

On Dec 21, 6:06 pm, Oltmans  wrote:
> On Dec 21, 5:05 pm, Umakanth  wrote:
>
> > How about re.findall(r'\d+(?:\.\d+)?',str)
>
> > extracts only numbers from any string
>
> Thank you. However, I only need the digits within the ID attribute of
> the DIV. Regex that you suggested fails on the following string
>
> 
> lksjdfls  kdjff lsdfs  sdjfls  =   "amazon_35343433">sdfsdwelcome
> hello, my age is 86 years old and I was born in 1945. Do you know that
> PI is roughly 3.1443534534534534534
> 
>
> > ~uk
>
> > On Dec 21, 4:38 pm, Oltmans  wrote:
>
> > > Hello,. everyone.
>
> > > I've a string that looks something like
> > > 
> > > lksjdfls  kdjff lsdfs  sdjfls  > > =   "amazon_35343433">sdfsdwelcome
> > > 
>
> > > From above string I need the digits within the ID attribute. For
> > > example, required output from above string is
> > > - 35343433
> > > - 345343
> > > - 8898
>
> > > I've written this regex that's kind of working
> > > re.findall("\w+\s*\W+amazon_(\d+)",str)
>
> > > but I was just wondering that there might be a better RegEx to do that
> > > same thing. Can you kindly suggest a better/improved Regex. Thank you
> > > in advance.
>
>

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-21 Thread Oltmans
On Dec 21, 5:05 pm, Umakanth  wrote:
> How about re.findall(r'\d+(?:\.\d+)?',str)
>
> extracts only numbers from any string
>

Thank you. However, I only need the digits within the ID attribute of
the DIV. Regex that you suggested fails on the following string


lksjdfls  kdjff lsdfs  sdjfls sdfsdwelcome
hello, my age is 86 years old and I was born in 1945. Do you know that
PI is roughly 3.1443534534534534534





> ~uk
>
> On Dec 21, 4:38 pm, Oltmans  wrote:
>
> > Hello,. everyone.
>
> > I've a string that looks something like
> > 
> > lksjdfls  kdjff lsdfs  sdjfls  > =   "amazon_35343433">sdfsdwelcome
> > 
>
> > From above string I need the digits within the ID attribute. For
> > example, required output from above string is
> > - 35343433
> > - 345343
> > - 8898
>
> > I've written this regex that's kind of working
> > re.findall("\w+\s*\W+amazon_(\d+)",str)
>
> > but I was just wondering that there might be a better RegEx to do that
> > same thing. Can you kindly suggest a better/improved Regex. Thank you
> > in advance.
>
>

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-21 Thread Peter Otten
Oltmans wrote:

> I've a string that looks something like
> 
> lksjdfls  kdjff lsdfs  sdjfls  =   "amazon_35343433">sdfsdwelcome
> 
> 
> From above string I need the digits within the ID attribute. For
> example, required output from above string is
> - 35343433
> - 345343
> - 8898
> 
> I've written this regex that's kind of working
> re.findall("\w+\s*\W+amazon_(\d+)",str)
> 
> but I was just wondering that there might be a better RegEx to do that
> same thing. Can you kindly suggest a better/improved Regex. Thank you
> in advance.

>>> from BeautifulSoup import BeautifulSoup
>>> bs = BeautifulSoup("""lksjdfls  kdjff lsdfs 
 sdjfls sdfsdwelcome""")
>>> [node["id"][7:] for node in bs(id=lambda id: id.startswith("amazon_"))]
[u'345343', u'35343433', u'8898']

I think BeautifulSoup is a better tool for the task since it actually 
"understands" HTML.

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-21 Thread mik3
On Dec 21, 7:38 pm, Oltmans  wrote:
> Hello,. everyone.
>
> I've a string that looks something like
> 
> lksjdfls  kdjff lsdfs  sdjfls  =   "amazon_35343433">sdfsdwelcome
> 
>
> From above string I need the digits within the ID attribute. For
> example, required output from above string is
> - 35343433
> - 345343
> - 8898
>
> I've written this regex that's kind of working
> re.findall("\w+\s*\W+amazon_(\d+)",str)
>
> but I was just wondering that there might be a better RegEx to do that
> same thing. Can you kindly suggest a better/improved Regex. Thank you
> in advance.

don't need regular expression. just do a split on amazon

>>> s="""lksjdfls  kdjff lsdfs  sdjfls >> =   "amazon_35343433">sdfsdwelcome"""

>>> for item in s.split("amazon_")[1:]:
...   print item
...
345343'> kdjff lsdfs  sdjfls sdfsdwelcome

then find  ' or " indices and do index  slicing.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed!

2009-12-21 Thread Umakanth
How about re.findall(r'\d+(?:\.\d+)?',str)

extracts only numbers from any string

~uk

On Dec 21, 4:38 pm, Oltmans  wrote:
> Hello,. everyone.
>
> I've a string that looks something like
> 
> lksjdfls  kdjff lsdfs  sdjfls  =   "amazon_35343433">sdfsdwelcome
> 
>
> From above string I need the digits within the ID attribute. For
> example, required output from above string is
> - 35343433
> - 345343
> - 8898
>
> I've written this regex that's kind of working
> re.findall("\w+\s*\W+amazon_(\d+)",str)
>
> but I was just wondering that there might be a better RegEx to do that
> same thing. Can you kindly suggest a better/improved Regex. Thank you
> in advance.

-- 
http://mail.python.org/mailman/listinfo/python-list


Regex help needed!

2009-12-21 Thread Oltmans
Hello,. everyone.

I've a string that looks something like

lksjdfls  kdjff lsdfs  sdjfls sdfsdwelcome


>From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed

2006-01-10 Thread Michael Spencer
rh0dium wrote:
> Michael Spencer wrote:
>>   >>> def parse(source):
>>   ... source = source.splitlines()
>>   ... original, rest = source[0], "\n".join(source[1:])
>>   ... return original, rest_eval(get_tokens(rest))
> 
> This is a very clean and elegant way to separate them - Very nice!!  I
> like this alot - I will definately use this in the future!!
> 
>> Cheers
>>
>> Michael
> 
On reflection, this simplifies further (to 9 lines), at least for the test 
cases 
your provide, which don't involve any nested parens:

  >>> import cStringIO, tokenize
  ...
  >>> def get_tokens2(source):
  ... src = cStringIO.StringIO(source).readline
  ... src = tokenize.generate_tokens(src)
  ... return [token[1][1:-1] for token in src if token[0] == 
tokenize.STRING]
  ...
  >>> def parse2(source):
  ... source = source.splitlines()
  ... original, rest = source[0], "\n".join(source[1:])
  ... return original, get_tokens2(rest)
  ...
  >>>

This matches your main function for the three tests where main works...

  >>> for source in sources[:3]: #matches your main function where it works
  ... assert parse2(source) == main(source)
  ...
  Original someFunction
  Orig someFunction Results ['test', 'foo']
  Original someFunction
  Orig someFunction Results ['test  foo']
  Original someFunction
  Orig someFunction Results ['test', 'test1', 'foo aasdfasdf', 'newline', 
'test2']

...and handles the case where main fails (I think correctly, although I'm not 
entirely sure what your desired output is in this case:
  >>> parse2(sources[3])
  ('getVersion()', ['@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 
(cicln01) 
$'])
  >>>

If you really do need nested parens, then you'd need the slightly longer 
version 
I posted earlier

Cheers

Michael

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed

2006-01-10 Thread Paul McGuire
"rh0dium" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
>
> Paul McGuire wrote:
>
> > ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR )
>
> This will only work for a word with a parentheses ( ie.  somefunction()
> )
>
> > If you *really* want everything on the first line to be the ident, try
this:
> >
> > ident = Word(alpha,alphanums+"_") + restOfLine
> > or
> > ident = Combine( Word(alpha,alphanums+"_") + restOfLine )
>
> This nicely grabs the "\r"..  How can I get around it?
>
> > Now the next step is to assign field names to the results:
> >
> > dataFormat = ident.setResultsName("ident") + ( dblQuotedString |
> > quoteList ).setResultsName("contents")
>
> This is super cool!!
>
> So let's take this for example
>
> test= 'fprintf( outFile "leSetInstSelectable( t )\n" )\r\n ("test"
> "test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n'
>
> Now I want the ident to pull out 'fprintf( outFile
> "leSetInstSelectable( t )\n" )' so I tried to do this?
>
> ident = Forward()
> ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore(
> dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR)
>
> Borrowing from the example listed previously.  But it bombs out cause
> it wants a ")"  but it has one..  Forward() ROCKS!!
>
> Also how does it know to do this for just the first line?  It would
> seem that this will work for every line - No?
>
This works for me:

test4 = r"""fprintf( outFile "leSetInstSelectable( t )\n" )
("test"
"test1" "foo aasdfasdf"
"newline" "test2")
"""

ident = Forward()
ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore(
dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR)
dataFormat = ident + ( dblQuotedString | quoteList )

print dataFormat.parseString(test4)

Prints:
[['fprintf', '(', 'outFile', '"leSetInstSelectable( t )\\n"', ')'],
['"test"', '"test1"', '"foo aasdfasdf"', '"newline"', '"test2"']]


1. Is there supposed to be a real line break in the string
"leSetInstSelectable( t )\n", or just a slash-n at the end?  pyparsing
quoted strings do not accept multiline quotes, but they do accept escaped
characters such as "\t" "\n", etc.  That is, to pyparsing:

"\n this is a valid \t \n string"

"this is not
a valid string"

Part of the confusion is that your examples include explicit \r\n
characters.  I'm assuming this is to reflect what you see when listing out
the Python variable containing the string.  (Are you opening a text file
with "rb" to read in binary?  Try opening with just "r", and this may
resolve your \r\n problems.)

2. If restOfLine is still giving you \r's at the end, you can redefine
restOfLine to not include them, or to include and suppress them.  Or (this
is easier) define a parse action for restOfLine that strips trailing \r's:

def stripTrailingCRs(st,loc,toks):
try:
  if toks[0][-1] == '\r':
return toks[0][:-1]
except:
  pass

restOfLine.setParseAction( stripTrailingCRs )


3.  How does it know to only do it for the first line?  Presumably you told
it to do so.  pyparsing's parseString method starts at the beginning of the
input string, and matches expressions until it finds a mismatch, or runs out
of expressions to match - even if there is more input string to process,
pyparsing does not continue.  To search through the whole file looking for
idents, try using scanString which returns a generator; for each match, the
generator gives a tuple containing:
- tokens - the matched tokens
- start - the start location of the match
- end - the end location of the match

If your input file consists *only* of these constructs, you can also just
expand dataFormat.parseString to OneOrMore(dataFormat).parseString.


-- Paul


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed

2006-01-10 Thread rh0dium

Michael Spencer wrote:
>   >>> def parse(source):
>   ... source = source.splitlines()
>   ... original, rest = source[0], "\n".join(source[1:])
>   ... return original, rest_eval(get_tokens(rest))

This is a very clean and elegant way to separate them - Very nice!!  I
like this alot - I will definately use this in the future!!

> 
> Cheers
> 
> Michael

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed

2006-01-10 Thread rh0dium

Paul McGuire wrote:

> ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR )

This will only work for a word with a parentheses ( ie.  somefunction()
)

> If you *really* want everything on the first line to be the ident, try this:
>
> ident = Word(alpha,alphanums+"_") + restOfLine
> or
> ident = Combine( Word(alpha,alphanums+"_") + restOfLine )

This nicely grabs the "\r"..  How can I get around it?

> Now the next step is to assign field names to the results:
>
> dataFormat = ident.setResultsName("ident") + ( dblQuotedString |
> quoteList ).setResultsName("contents")

This is super cool!!

So let's take this for example

test= 'fprintf( outFile "leSetInstSelectable( t )\n" )\r\n ("test"
"test1" "foo aasdfasdf"\r\n "newline" "test2")\r\n'

Now I want the ident to pull out 'fprintf( outFile
"leSetInstSelectable( t )\n" )' so I tried to do this?

ident = Forward()
ident << Group( Word(alphas,alphanums) + LPAR + ZeroOrMore(
dblQuotedString | ident | Word(alphas,alphanums) ) + RPAR)

Borrowing from the example listed previously.  But it bombs out cause
it wants a ")"  but it has one..  Forward() ROCKS!!

Also how does it know to do this for just the first line?  It would
seem that this will work for every line - No?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed

2006-01-10 Thread Michael Spencer
rh0dium wrote:
> Hi all,
> 
> I am using python to drive another tool using pexpect.  The values
> which I get back I would like to automatically put into a list if there
> is more than one return value. They provide me a way to see that the
> data is in set by parenthesising it.
> 
...

> 
> CAN SOMEONE PLEASE CLEAN THIS UP?
> 

How about using the Python tokenizer rather than re:

  >>> import cStringIO, tokenize
  ...
  >>> def get_tokens(source):
  ... allowed_tokens = (tokenize.STRING, tokenize.OP)
  ... src = cStringIO.StringIO(source).readline
  ... src = tokenize.generate_tokens(src)
  ... return (token[1] for token in src if token[0] in allowed_tokens)
  ...
  >>> def rest_eval(tokens):
  ... output = []
  ... for token in tokens:
  ... if token == "(":
  ... output.append(rest_eval(tokens))
  ... elif token == ")":
  ... return output
  ... else:
  ... output.append(token[1:-1])
  ... return output
  ...
  >>> def parse(source):
  ... source = source.splitlines()
  ... original, rest = source[0], "\n".join(source[1:])
  ... return original, rest_eval(get_tokens(rest))
  ...
  >>> sources = [
  ... 'someFunction\r\n "test" "foo"\r\n',
  ... 'someFunction\r\n "test  foo"\r\n',
  ... 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 
(cicln01) $"\r\n',
  ... 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n "newline" 
"test2")\r\n']
  >>>
  >>> for data in sources: parse(data)
  ...
  ('someFunction', ['test', 'foo'])
  ('someFunction', ['test  foo'])
  ('getVersion()', ['@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 
(cicln01) 
$'])
  ('someFunction', [['test', 'test1', 'foo aasdfasdf', 'newline', 'test2']])
  >>>

Cheers

Michael

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed

2006-01-10 Thread Paul McGuire
"rh0dium" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
>
> Paul McGuire wrote:
> > -- Paul
> > (Download pyparsing at http://pyparsing.sourceforge.net.)
>
> Done.
>
>
> Hey this is pretty cool!  I have one small problem that I don't know
> how to resolve.  I want the entire contents (whatever it is) of line 1
> to be the ident.  Now digging into the code showed a method line,
> lineno and LineStart LineEnd.  I tried to use all three but it didn't
> work for a few reasons ( line = type issues, lineno - I needed the data
> and could't get it to work, LineStart/End - I think it matches every
> line and I need the scope to line 1 )
>
> So here is my rendition of the code - But this is REALLY slick..
>
> I think the problem is the parens on line one
>
> def main(data=None):
>
> LPAR = Literal("(")
> RPAR = Literal(")")
>
> # assume function identifiers must start with alphas, followed by
> zero or more
> # alphas, numbers, or '_' - expand this defn as needed
> ident = LineStart + LineEnd
>
> # define a list as one or more quoted strings, inside ()'s - we'll
> tackle nesting
> # in a minute
> quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) +
> RPAR.suppress())
>
> # define format of a line of data - don't bother with \n's or \r's,
>
> # pyparsing just skips 'em
> dataFormat = ident + ( dblQuotedString | quoteList )
>
> return dataFormat.parseString(data)
>
>
> # General run..
> if __name__ == '__main__':
>
>
> # data = 'someFunction\r\n "test" "foo"\r\n'
> # data = 'someFunction\r\n "test  foo"\r\n'
> data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0
> 05/22/2005 23:36 (cicln01) $"\r\n'
> # data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n
> "newline" "test2")\r\n'
>
> foo = main(data)
>
> print foo
>

LineStart() + LineEnd() will only match an empty line.


If you describe in words what you want ident to be, it may be more natural
to translate to pyparsing.

"A word starting with an alpha, followed by zero or more alphas, numbers, or
'_'s, with a trailing pair of parens"

ident = Word(alpha,alphanums+"_") + LPAR + RPAR


If you want the ident all combined into a single token, use:

ident = Combine( Word(alpha,alphanums+"_") + LPAR + RPAR )


LineStart and LineEnd are geared more for line-oriented or
whitespace-sensitive grammars.  Your example doesn't really need them, I
don't think.

If you *really* want everything on the first line to be the ident, try this:

ident = Word(alpha,alphanums+"_") + restOfLine
or
ident = Combine( Word(alpha,alphanums+"_") + restOfLine )


Now the next step is to assign field names to the results:

dataFormat = ident.setResultsName("ident") + ( dblQuotedString |
quoteList ).setResultsName("contents")

test = "blah blah test string"

results = dataFormat.parseString(test)
print results.ident, results.contents

I'm glad pyparsing is working out for you!  There should be a number of
examples that ship with pyparsing that may give you some more ideas on how
to proceed from here.

-- Paul


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed

2006-01-10 Thread rh0dium

Paul McGuire wrote:
> -- Paul
> (Download pyparsing at http://pyparsing.sourceforge.net.)

Done.


Hey this is pretty cool!  I have one small problem that I don't know
how to resolve.  I want the entire contents (whatever it is) of line 1
to be the ident.  Now digging into the code showed a method line,
lineno and LineStart LineEnd.  I tried to use all three but it didn't
work for a few reasons ( line = type issues, lineno - I needed the data
and could't get it to work, LineStart/End - I think it matches every
line and I need the scope to line 1 )

So here is my rendition of the code - But this is REALLY slick..

I think the problem is the parens on line one

def main(data=None):

LPAR = Literal("(")
RPAR = Literal(")")

# assume function identifiers must start with alphas, followed by
zero or more
# alphas, numbers, or '_' - expand this defn as needed
ident = LineStart + LineEnd

# define a list as one or more quoted strings, inside ()'s - we'll
tackle nesting
# in a minute
quoteList = Group( LPAR.suppress() + OneOrMore(dblQuotedString) +
RPAR.suppress())

# define format of a line of data - don't bother with \n's or \r's,

# pyparsing just skips 'em
dataFormat = ident + ( dblQuotedString | quoteList )

return dataFormat.parseString(data)


# General run..
if __name__ == '__main__':


# data = 'someFunction\r\n "test" "foo"\r\n'
# data = 'someFunction\r\n "test  foo"\r\n'
data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0
05/22/2005 23:36 (cicln01) $"\r\n'
# data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n
"newline" "test2")\r\n'

foo = main(data)

print foo

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regex help needed

2006-01-10 Thread Paul McGuire
"rh0dium" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Hi all,
>
> I am using python to drive another tool using pexpect.  The values
> which I get back I would like to automatically put into a list if there
> is more than one return value. They provide me a way to see that the
> data is in set by parenthesising it.
>


Well, you asked for regex help, but a pyparsing rendition may be easier to
read and maintain.

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)


# test data strings
test1 = """somefunction()
"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005 23:36 (cicln01) $"
"""

test2 = """somefunction()
("." "~"
"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"
"foo")
"""

test3 = """somefunctionWithNestedlist()
("." "~"
"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"
("Hey!"
"this is a nested"
"list")
"foo")
"""

"""
So if you're still reading this I want to parse out data.  Here are the
rules...
- Line 1 ALWAYS is the calling function whatever is there (except
"\r\n") should be kept as "original"
- Anything may occur inside the quotations - I don't care what's in
there per se but it must be maintained.
- Parenthesed items I want to be pushed into a list.  I haven't run
into a case where you have nested paren's but that not to say it won't
happen...
"""

from pyparsing import Literal, Word, alphas, alphanums, \
dblQuotedString, OneOrMore, Group, Forward

LPAR = Literal("(")
RPAR = Literal(")")

# assume function identifiers must start with alphas, followed by zero or
more
# alphas, numbers, or '_' - expand this defn as needed
ident = Word(alphas,alphanums+"_")

# define a list as one or more quoted strings, inside ()'s - we'll tackle
nesting
# in a minute
quoteList = Group( LPAR.suppress() +
   OneOrMore(dblQuotedString) +
   RPAR.suppress() )

# define format of a line of data - don't bother with \n's or \r's,
# pyparsing just skips 'em
dataFormat = ident + LPAR + RPAR + ( dblQuotedString | quoteList )

def test(t):
print dataFormat.parseString(t)

print "Parse flat lists"
test(test1)
test(test2)

# modifications for nested lists
quoteList = Forward()
quoteList << Group( LPAR.suppress() +
   OneOrMore(dblQuotedString | quoteList) +
   RPAR.suppress() )
dataFormat = ident + LPAR + RPAR + ( dblQuotedString | quoteList )

print
print "Parse using nested lists"
test(test1)
test(test2)
test(test3)

Parsing results:
Parse flat lists
['somefunction', '(', ')', '"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005
23:36 (cicln01) $"']
['somefunction', '(', ')', ['"."', '"~"',
'"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', '"foo"']]

Parse using nested lists
['somefunction', '(', ')', '"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005
23:36 (cicln01) $"']
['somefunction', '(', ')', ['"."', '"~"',
'"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', '"foo"']]
['somefunctionWithNestedlist', '(', ')', ['"."', '"~"',
'"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"', ['"Hey!"',
'"this is a nested"', '"list"'], '"foo"']]



-- 
http://mail.python.org/mailman/listinfo/python-list


Regex help needed

2006-01-10 Thread rh0dium
Hi all,

I am using python to drive another tool using pexpect.  The values
which I get back I would like to automatically put into a list if there
is more than one return value. They provide me a way to see that the
data is in set by parenthesising it.

This is all generated as I said using pexpect - Here is how I use it..
 child = pexpect.spawn( _buildCadenceExe(), timeout=timeout)
 child.sendline("somefunction()")
 child.expect("> ")
 data=child.before

Given this data can take on several shapes:

Single return value -- THIS IS THE ONE I CAN'T GET TO WORK..
data = 'somefunction()\r\n"@(#)$CDS: icfb.exe version 5.1.0 05/22/2005
23:36 (cicln01) $"\r\n'

Multiple return value
data = 'somefunction()\r\n("." "~"
"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile")\r\n'

It may take up several lines...
data = 'somefunction()\r\n("." "~"
\r\n"/eda/ic_5.10.41.500.1.18/tools.lnx86/dfII/samples/techfile"\r\n"foo")\r\n'

So if you're still reading this I want to parse out data.  Here are the
rules...
- Line 1 ALWAYS is the calling function whatever is there (except
"\r\n") should be kept as "original"
- Anything may occur inside the quotations - I don't care what's in
there per se but it must be maintained.
- Parenthesed items I want to be pushed into a list.  I haven't run
into a case where you have nested paren's but that not to say it won't
happen...

So here is my code..  Pardon my hack job..

import os,re

def main(data=None):

# Get rid of the annoying \r's
dat=data.split("\r")
data="".join(dat)

# Remove the first line - that is the original call
dat = data.split("\n")
original=dat[0]
del dat[0]

print "Original", original
# Now join all of the remaining lines
retl="".join(dat)

# self.logger.debug("Original = \'%s\'" % original)

try:
# Get rid of the parenthesis
parmatcher = re.compile( r'\(([^()]*)\)' )
parmatch = parmatcher.search(retl)

# Get rid of the first and last quotes
qrmatcher = re.compile( r'\"([^()]*)\"' )
qrmatch = qrmatcher.search(parmatch.group(1))

# Split the items
qmatch=re.compile(r'\"\s+\"')
results = qmatch.split(qrmatch.group(1))
except:
qrmatcher = re.compile( r'\"([^()]*)\"' )
qrmatch = qrmatcher.search(retl)

# Split the items
qmatch=re.compile(r'\"\s+\"')
results = qmatch.split(qrmatch.group(1))

print "Orig", original, "Results", results
return original,results


# General run..
if __name__ == '__main__':


# data = 'someFunction\r\n "test" "foo"\r\n'
# data = 'someFunction\r\n "test  foo"\r\n'
data = 'getVersion()\r\n"@(#)$CDS: icfb.exe version 5.1.0
05/22/2005 23:36 (cicln01) $"\r\n'
# data = 'someFunction\r\n ("test" "test1" "foo aasdfasdf"\r\n
"newline" "test2")\r\n'

main(data)

CAN SOMEONE PLEASE CLEAN THIS UP?

-- 
http://mail.python.org/mailman/listinfo/python-list