Re: A regular expression question

2016-09-28 Thread Ben Finney
Cpcp Cp  writes:

> Look this
>
> >>> import re
> >>> text="asdfnbd]"
> >>> m=re.sub("n*?","?",text)
> >>> print m
> ?a?s?d?f?n?b?d?]?
>
> I don't understand the 'non-greedy' pattern.

Since ‘n*’ matches zero or more ‘n’s, it matches zero adjacent to every
actual character.

It's non-greedy because it matches as few characters as will allow the
match to succeed.

> I think the repl argument should replaces every char in text and
> outputs "".

I hope that helps you understand why that expectation is wrong :-)

Regular expression patterns are *not* an easy topic. Try experimenting
and learning with http://www.regexr.com/>.

-- 
 \  “If I haven't seen as far as others, it is because giants were |
  `\   standing on my shoulders.” —Hal Abelson |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


A regular expression question

2016-09-28 Thread Cpcp Cp
Look this

>>> import re
>>> text="asdfnbd]"
>>> m=re.sub("n*?","?",text)
>>> print m
?a?s?d?f?n?b?d?]?

I don't understand the 'non-greedy' pattern.

I think the repl argument should replaces every char in text and outputs 
"".

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: regular expression question (re module)

2008-10-16 Thread Philip Semanchuk


On Oct 16, 2008, at 11:25 PM, Steve Holden wrote:


Pat wrote:

Faheem Mitha wrote:

Hi,

I need to match a string of the form

capital_letter underscore capital_letter number

against a string of the form

anything capital_letter underscore capital_letter number
some_stuff_not_starting with a number




DUKE1_plateD_A12.CEL.

Thanks in advance. Please cc me with any reply.

Faheem.




While I can't provide you with an answer, I can say that I've been  
using

RegExBuddy (for Windows, about $40, 90 day money back guarantee,
http://www.regexbuddy.com/) for quite a few months now and it's  
greatly

helped me with creating/learning/debugging regexps.  You put in your
regexp in the top field and all the possibilities in the bottom  
field.
Whatever matches is instantly highlighted.  You keep modifying your  
RE

until only the correct matches are highlighted. Talk about instant
gratification!  No, I'm in no way affiliated with this company.

There's also a free *IX version that's quite similar to RegExBuddy  
but I
don't have the name since I'm writing this while on a Windows  
platform.

--
http://mail.python.org/mailman/listinfo/python-list


Or you could use the Kodos tool, written in Python and well worth a
trial since it's free. Google is, as always, your friend in locating  
it.



I use this one as my regex playground:
http://cthedot.de/retest/






--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expression question (re module)

2008-10-16 Thread Steve Holden
Pat wrote:
> Faheem Mitha wrote:
>> Hi,
>>
>> I need to match a string of the form
>>
>> capital_letter underscore capital_letter number
>>
>> against a string of the form
>>
>> anything capital_letter underscore capital_letter number
>> some_stuff_not_starting with a number
>>
> 
>> DUKE1_plateD_A12.CEL.
>>
>> Thanks in advance. Please cc me with any reply.
>> Faheem.
>> 
> 
> While I can't provide you with an answer, I can say that I've been using
> RegExBuddy (for Windows, about $40, 90 day money back guarantee,
> http://www.regexbuddy.com/) for quite a few months now and it's greatly
> helped me with creating/learning/debugging regexps.  You put in your
> regexp in the top field and all the possibilities in the bottom field.
>  Whatever matches is instantly highlighted.  You keep modifying your RE
> until only the correct matches are highlighted. Talk about instant
> gratification!  No, I'm in no way affiliated with this company.
> 
> There's also a free *IX version that's quite similar to RegExBuddy but I
> don't have the name since I'm writing this while on a Windows platform.
> -- 
> http://mail.python.org/mailman/listinfo/python-list
> 
Or you could use the Kodos tool, written in Python and well worth a
trial since it's free. Google is, as always, your friend in locating it.

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expression question (re module)

2008-10-16 Thread Pat

Faheem Mitha wrote:

Hi,

I need to match a string of the form

capital_letter underscore capital_letter number

against a string of the form

anything capital_letter underscore capital_letter number
some_stuff_not_starting with a number




DUKE1_plateD_A12.CEL.

Thanks in advance. Please cc me with any reply. 
Faheem.



While I can't provide you with an answer, I can say that I've been using 
RegExBuddy (for Windows, about $40, 90 day money back guarantee, 
http://www.regexbuddy.com/) for quite a few months now and it's greatly 
helped me with creating/learning/debugging regexps.  You put in your 
regexp in the top field and all the possibilities in the bottom field. 
 Whatever matches is instantly highlighted.  You keep modifying your RE 
until only the correct matches are highlighted. Talk about instant 
gratification!  No, I'm in no way affiliated with this company.


There's also a free *IX version that's quite similar to RegExBuddy but I 
don't have the name since I'm writing this while on a Windows platform.

--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expression question (re module)

2008-10-11 Thread bearophileHUGS
Faheem Mitha:
> I need to match a string of the form
> ...

Please, show the code you have written so far, with your input-output
examples included (as doctests, for example), and we can try to find
ways to help you remove the bugs you have.

Bye,
bearophile
--
http://mail.python.org/mailman/listinfo/python-list


regular expression question (re module)

2008-10-11 Thread Faheem Mitha
Hi,

I need to match a string of the form

capital_letter underscore capital_letter number

against a string of the form

anything capital_letter underscore capital_letter number
some_stuff_not_starting with a number

Eg D_A1 needs to match with DUKE1_plateD_A1.CEL, but not any of
DUKE1_plateD_A10.CEL, Duke1_PlateD_A11v2.CEL,
DUKE1_plateD_A12.CEL.

Similarly D_A10 needs to match DUKE1_plateD_A10.CEL, but not any
of DUKE1_plateD_A1.CEL, Duke1_PlateD_A11v2.CEL,
DUKE1_plateD_A12.CEL.

Similarly D_A11 needs to match Duke1_PlateD_A11v2.CEL, but not any
of DUKE1_plateD_A1.CEL, DUKE1_plateD_A10.CEL,
DUKE1_plateD_A12.CEL.

Thanks in advance. Please cc me with any reply. 
Faheem.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2007-10-25 Thread looping
On Oct 25, 9:25 am, Peter Otten <[EMAIL PROTECTED]> wrote:
>
> You want a "negative lookahead assertion" then:
>

Now I feel dumb...
I've seen the (?!...) dozen times in the doc but never figure out that
it is what I'm looking for.

So this one is the winner:
s = re.search(r'create\s+or\s+replace\s+package\s+(?!body\s+)', txt,
re.IGNORECASE)

Thanks Peter and Marc.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2007-10-25 Thread Peter Otten
looping wrote:

> On Oct 25, 8:49 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
>>
>> needle = re.compile(r'create\s+or\s+replace\s+package(\s+body)?\s+',
>> re.IGNORECASE)
> 
> What I want here is a RE that return ONLY the line without the "body"
> keyword.
> Your RE return both.
> I know I could use it but I want to learn how to search something that
> is NOT in the string using RE.

You want a "negative lookahead assertion" then:

>>> import re
>>> s = """Isaac Newton
... Isaac Asimov
... Isaac Singer
... """
>>> re.compile("Isaac (?!Asimov).*").findall(s)
['Isaac Newton', 'Isaac Singer']

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2007-10-25 Thread looping
On Oct 25, 8:49 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
>
> needle = re.compile(r'create\s+or\s+replace\s+package(\s+body)?\s+',
> re.IGNORECASE)

What I want here is a RE that return ONLY the line without the "body"
keyword.
Your RE return both.
I know I could use it but I want to learn how to search something that
is NOT in the string using RE.



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2007-10-24 Thread Marc 'BlackJack' Rintsch
On Thu, 25 Oct 2007 06:34:03 +, looping wrote:

> Hi,
> It's not really a Python question but I'm sure someone could help me.
> 
> When I use RE, I always have trouble with this kind of search:
> 
> Ex.
> 
> I've a text file:
> """
> create or replace package XXX
> ...
> 
> create or replace package body XXX
> ...
> """
> now I want to search the position (line) of this two string.
> 
> for the body I use:
> s = re.search(r'create\s+or\s+replace\s+package\s+body\s+', txt,
> re.IGNORECASE)
> 
> but how to search for the other line ?
> I want the same RE but explicitly without "body".

The write the same RE but explicitly without "body".  But I guess I didn't
understand your problem when the answer is that obvious.

Maybe you want to iterate over the text file line by line and match or
search within the line? Untested:

needle = re.compile(r'create\s+or\s+replace\s+package(\s+body)?\s+',
re.IGNORECASE)
for i, line in enumerate(lines):
if needle.match(line):
print 'match in line %d' % (i + 1)

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Regular Expression question

2007-10-24 Thread looping
Hi,
It's not really a Python question but I'm sure someone could help me.

When I use RE, I always have trouble with this kind of search:

Ex.

I've a text file:
"""
create or replace package XXX
...

create or replace package body XXX
...
"""
now I want to search the position (line) of this two string.

for the body I use:
s = re.search(r'create\s+or\s+replace\s+package\s+body\s+', txt,
re.IGNORECASE)

but how to search for the other line ?
I want the same RE but explicitly without "body".

Thanks for your help.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python regular expression question!

2006-09-20 Thread unexpected
Sweet! Thanks so much!

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python regular expression question!

2006-09-20 Thread Ant

unexpected wrote:
> > \b matches the beginning/end of a word (characters a-zA-Z_0-9).
> > So that regex will match e.g. MULTX-FOO but not MULTX-.
> >
>
> So is there a way to get \b to include - ?

No, but you can get the behaviour you want using negative lookaheads.
The following regex is effectively \b where - is treated as a word
character:

pattern = r"(?![a-zA-Z0-9_-])"

This effectively matches the next character that isn't in the group
[a-zA-Z0-9_-] but doesn't consume it. For example:

>>> p = re.compile(r".*?(?![a-zA-Z0-9_-])(.*)")
>>> s = "aabbcc_d-f-.XXX YYY"
>>> m = p.search(s)
>>> print m.group(1)
.XXX YYY

Note that the regex recognises the '.' as the end of the word, but
doesn't use it up in the match, so it is present in the final capturing
group. Contrast it with:

>>> p = re.compile(r".*?[^a-zA-Z0-9_-](.*)")
>>> s = "aabbcc_d-f-.XXX YYY"
>>> m = p.search(s)
>>> print m.group(1)
XXX YYY

Note here that "[^a-zA-Z0-9_-]" still denotes the end of the word, but
this time consumes it, so it doesn't appear in the final captured group.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python regular expression question!

2006-09-20 Thread unexpected

> \b matches the beginning/end of a word (characters a-zA-Z_0-9).
> So that regex will match e.g. MULTX-FOO but not MULTX-.
> 

So is there a way to get \b to include - ?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python regular expression question!

2006-09-20 Thread Hallvard B Furuseth
"unexpected" <[EMAIL PROTECTED]> writes:

> I'm trying to do a whole word pattern match for the term 'MULTX-'
>
> Currently, my regular expression syntax is:
>
> re.search(('^')+(keyword+'\\b')

\b matches the beginning/end of a word (characters a-zA-Z_0-9).
So that regex will match e.g. MULTX-FOO but not MULTX-.

Incidentally, in case the keyword contains regex special characters
(like '*') you may wish to escape it: re.escape(keyword).

-- 
Hallvard
-- 
http://mail.python.org/mailman/listinfo/python-list


Python regular expression question!

2006-09-20 Thread unexpected
I'm trying to do a whole word pattern match for the term 'MULTX-'

Currently, my regular expression syntax is:

re.search(('^')+(keyword+'\\b')

where keyword comes from a list of terms. ('MULTX-' is in this list,
and hence a keyword).

My regular expression works for a variety of different keywords except
for 'MULTX-'. It does work for MULTX, however, so I'm thinking that the
'-' sign is delimited as a word boundary. Is there any way to get
Python to override this word boundary?

I've tried using raw strings, but the syntax is painful. My attempts
were:

re.search(('^')+("r"+keyword+'\b')
re.search(('^')+("r'"+keyword+'\b')

and then tried the even simpler:

re.search(('^')+("r'"+keyword)
re.search(('^')+("r''"+keyword)


and all of those failed for everything. Any suggestions?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-22 Thread Anthra Norell
Steve,
   I thought Fredrik Lundh's proposal was perfect. Are you now saying it 
doesn't solve your problem because your description of the
problem was incomplete? If so, could you post a worst case piece of htm, one 
that contains all possible complications, or a
collection of different cases all of which you need to handle?

Frederic

- Original Message -
From: <[EMAIL PROTECTED]>
Newsgroups: comp.lang.python
To: 
Sent: Monday, August 21, 2006 11:35 PM
Subject: Re: Regular Expression question


> Hi, thanks everyone for the information! Still going through it :)
>
> The reason I did not match on tag2 in my original expression (and I
> apologize because I should have mentioned this before) is that other
> tags could also have an attribute with the value of "adj__" and the
> attribute name may not be the same for the other tags. The only thing I
> can be sure of is that the value will begin with "adj__".
>
> I need to match the "adj__" value with the closest preceding tag1
> irrespective of what tag the "adj__" is in, or what the attribute
> holding it is called, or the order of the attributes (there may be
> others). This data will be inside an html page and so there will be
> plenty of html tags in the middle all of which I need to ignore.
>
> Thanks very much!
> Steve
>
> --
> http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread stevebread
Hi, thanks everyone for the information! Still going through it :)

The reason I did not match on tag2 in my original expression (and I
apologize because I should have mentioned this before) is that other
tags could also have an attribute with the value of "adj__" and the
attribute name may not be the same for the other tags. The only thing I
can be sure of is that the value will begin with "adj__".

I need to match the "adj__" value with the closest preceding tag1
irrespective of what tag the "adj__" is in, or what the attribute
holding it is called, or the order of the attributes (there may be
others). This data will be inside an html page and so there will be
plenty of html tags in the middle all of which I need to ignore.

Thanks very much!
Steve

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread Rob Wolfe

[EMAIL PROTECTED] wrote:
> got zero results on this one :)

Really?

>>> s = '''   


'''

>>> pat = re.compile('tag1.+?name="(.+?)".*?(?:<)(?=tag2).*?="adj__(.*?)__', 
>>> re.DOTALL)
>>> m = re.findall(pat, s)
>>> m
[('john', 'tall'), ('joe', 'short')]


Regards,
Rob

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread Paul McGuire
<[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Hi, I am having some difficulty trying to create a regular expression.
>
> Consider:
>
>
> 
> 
> 
>
> Whenever a tag1 is followed by a tag 2, I want to retrieve the values
> of the tag1:name and tag2:value attributes. So my end result here
> should be
> john, tall
> jack, short
>

A pyparsing solution may not be a speed demon to run, but doesn't take too
long to write.  Some short explanatory comments:
- makeHTMLTags returns a tuple of opening and closing tags, but this example
does not use any closing tags, so simpler to just discard them (only use
zero'th return value)
- Your example includes not only  and  tags, but also a 
tag, which is presumably ignorable.
- The value returned from calling the searchString generator includes named
fields for the different tag attributes, making it easy to access the name
and value tag attributes.
- The expression generated by makeHTMLTags will also handle tags with other
surprising attributes that we didn't anticipate (such as ""
or "")
- Pyparsing leaves the values as "adj__tall__" and "adj__short__", but some
simple string slicing gets us the data we want

The pyparsing home page is at http://pyparsing.wikispaces.com.

-- Paul


from pyparsing import makeHTMLTags

tag1 = makeHTMLTags("tag1")[0]
tag2 = makeHTMLTags("tag2")[0]
br = makeHTMLTags("br")[0]

# define the pattern we're looking for, in terms of tag1 and tag2
# and specify that we wish to ignore  tags
patt = tag1 + tag2
patt.ignore(br)

for tokens in patt.searchString(data):
print "%s, %s" % (tokens.startTag1.name, tokens.startTag2.value[5:-2])


Prints:
john, tall
jack, short


Printing tokens.dump() gives:
['tag1', ['name', 'jack'], True, 'tag2', ['value', 'adj__short__'], True]
- empty: True
- name: jack
- startTag1: ['tag1', ['name', 'jack'], True]
  - empty: True
  - name: jack
- startTag2: ['tag2', ['value', 'adj__short__'], True]
  - empty: True
  - value: adj__short__
- value: adj__short__


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread Neil Cerutti
On 2006-08-21, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> Hi, I am having some difficulty trying to create a regular expression.
>
> Consider:
>
>   
>
>
>
>
> Whenever a tag1 is followed by a tag 2, I want to retrieve the
> values of the tag1:name and tag2:value attributes. So my end
> result here should be
>
> john, tall
> jack, short
>
> Ideas?

It seems to me that an html parser might be a better solution.

Here's a slapped-together example. It uses a simple state
machine.

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.state = "get name"
self.name_attrs = None
self.result = {}

def handle_starttag(self, tag, attrs):
if self.state == "get name":
if tag == "tag1":
self.name_attrs = attrs
self.state = "found name"
elif self.state == "found name":
if tag == "tag2":
name = None
for attr in self.name_attrs:
if attr[0] == "name":
name = attr[1]
adj = None
for attr in attrs:
if attr[0] == "value" and attr[1][:3] == "adj":
adj = attr[1][5:-2]
if name == None or adj == None:
print "Markup error: expected attributes missing."
else:
self.result[name] = adj
self.state = "get name"
elif tag == "tag1":
# A new tag1 overrides the old one
self.name_attrs = attrs

p = MyHTMLParser()
p.feed("""
   



""")
print repr(p.result)
p.close()

There's probably a better way to search for attributes in attr
than "for attr in attrs", but I didn't think of it, and the
example I found on the net used the same idiom.  The format of
attrs seems strange. Why isn't it a dictionary?

-- 
Neil Cerutti
Sermon Outline: I. Delineate your fear II. Disown your fear III.
Displace your rear --Church Bulletin Blooper
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread Paddy

[EMAIL PROTECTED] wrote:
> Hi, I am having some difficulty trying to create a regular expression.

Steve,
I find this tool is great for debugging regular expressions.
  http://kodos.sourceforge.net/

Just put some sample text in one window, your trial RE in another, and
Kodos displays a wealth of information on what matches.

Try it.

- Paddy.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:
> Hi, I am having some difficulty trying to create a regular expression.
> 
> Consider:
> 
>
> 
> 
> 
> 
> Whenever a tag1 is followed by a tag 2, I want to retrieve the values
> of the tag1:name and tag2:value attributes.   So my end result here
> should be
> john, tall
> jack, short

import re

data = """
   



"""

elems = re.findall("<(tag1|tag2)\s+(\w+)=\"([^\"]*)\"/>", data)

for i in range(len(elems)-1):
 if elems[i][0] == "tag1" and elems[i+1][0] == "tag2":
print elems[i][2], elems[i+1][2]



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread stevebread
got zero results on this one :)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread Rob Wolfe

[EMAIL PROTECTED] wrote:
> Thanks, i just tried it but I got the same result.
>
> I've been thinking about it for a few hours now and the problem with
> this approach is that the .*? before the (?=tag2) may have matched a
> tag1 and i don't know how to detect it.

Maybe like this:
'tag1.+?name="(.+?)".*?(?:<)(?=tag2).*?="adj__(.*?)__'

HTH,
Rob

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread bearophileHUGS
I am not expert of REs yet, this my first possible solution:

import re

txt = """
   


"""

tfinder = r"""<# The opening < the tag to find
   \s* # Possible space or newline
   (tag[12])   # First subgroup, the identifier, tag1
or tag2
   \s+ # There must be a space or newline or
more
   (?:name|value)  # Name or value, non-grouping
   \s* # Possible space or newline
   =   # The =
   \s* # Possible space or newline
   "   # Opening "
([^"]*)# Second subgroup, the tag string, it
can't contain "
   "   # Closing " of the string
   \s* # Possible space or newline
   /?  # One optional ending /
   \s* # Possible space or newline
  ># The closing > of the tag
  ?# Greedy, match the first closing >
  """
patt = re.compile(tfinder, flags=re.I+re.X)

prec_type = ""
prec_string = ""
for mobj in patt.finditer(txt):
curr_type, curr_string = mobj.groups()
if curr_type == "tag2" and prec_type == "tag1":
print prec_string, curr_string.replace("adj__", "").strip("_")
prec_type = curr_type
prec_string = curr_string

Bye,
bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread stevebread
Thanks, i just tried it but I got the same result.

I've been thinking about it for a few hours now and the problem with
this approach is that the .*? before the (?=tag2) may have matched a
tag1 and i don't know how to detect it.

And even if I could, how would I make the search reset its start
position to the second tag1 it found?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-08-21 Thread Rob Wolfe

[EMAIL PROTECTED] wrote:
> Hi, I am having some difficulty trying to create a regular expression.
>
> Consider:
>
>
> 
> 
> 
>
> Whenever a tag1 is followed by a tag 2, I want to retrieve the values
> of the tag1:name and tag2:value attributes. So my end result here
> should be
> john, tall
> jack, short
>
> My low quality regexp
> re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj__(.*?)__',
> re.DOTALL)
>
> cannot handle the case where there is a tag1 that is not followed by a
> tag2. findall returns
> john, tall
> joe, short
>
> Ideas?

Have you tried this:

'tag1.+?name="(.+?)".*?(?=tag2).*?="adj__(.*?)__'

?

HTH,
Rob

-- 
http://mail.python.org/mailman/listinfo/python-list


Regular Expression question

2006-08-21 Thread stevebread
Hi, I am having some difficulty trying to create a regular expression.

Consider:

   




Whenever a tag1 is followed by a tag 2, I want to retrieve the values
of the tag1:name and tag2:value attributes. So my end result here
should be
john, tall
jack, short

My low quality regexp
re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj__(.*?)__',
re.DOTALL)

cannot handle the case where there is a tag1 that is not followed by a
tag2. findall returns
john, tall
joe, short

Ideas?

Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-06-08 Thread Duncan Booth
Paul McGuire wrote:

>> import re
>> r=re.compile('[^"]+)"[^>]*>',re.IGNORECASE)
>> for m in r.finditer(html):
>> print m.group('image')
>>
> 
> Ouch - this fails to match any  tag that has some other
> attribute, such as "height" or "width", before the "src" attribute. 
> www.yahoo.com has several such tags.

It also fails to match any image tag where the src attribute is quoted 
using single quotes, or where the src attribute is not enclosed in quotes 
at all.

Handle all of that correctly in the regex and the beautiful soup or 
pyparsing options look even more attractive. In fact, if anyone can write a 
regex which matches the source attribute in a single named group, and 
correctly handles double, single and unquoted attributes, I'll admit to 
being impressed (and probably also slightly queasy when looking at it).

Here's my best attempt at a regex that gets it right, but it still gets 
confused by other attributes if they contain spaces.

>>> ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?'''
>>> NOTSRC = '(?!src=)' + ATTR
>>> PAT = '''(?<=")[^">]*|(?<=')[^'>]*|[^ >]*)'''
>>> htmlPage = ''' '''
>>> for m in r.finditer(htmlPage):
print m.group('image')


fred.jpg
freda.jpg
>>> 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-06-07 Thread Paul McGuire
"Frank Potter" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> pyparsing is cool.
> but use only re is also OK
> # -*- coding: UTF-8 -*-
> import urllib2
> html=urllib2.urlopen(ur"http://www.yahoo.com/";).read()
>
> import re
> r=re.compile('[^"]+)"[^>]*>',re.IGNORECASE)
> for m in r.finditer(html):
> print m.group('image')
>

Ouch - this fails to match any  tag that has some other attribute, such
as "height" or "width", before the "src" attribute.  www.yahoo.com has
several such tags.

On the other hand, pyparsing's makeHTMLTags defines a starting tag
expression that looks for (conceptually):

< tagname ZeroOrMore(attrname '=' value) Optional('/') >

and does not assume that the first tag is "src", or anything else for that
matter.

The returned results make the tag attributes accessible as object attributes
or dictionary keys.

-- Paul


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-06-07 Thread Frank Potter
pyparsing is cool.
but use only re is also OK
# -*- coding: UTF-8 -*-
import urllib2
html=urllib2.urlopen(ur"http://www.yahoo.com/";).read()

import re
r=re.compile('[^"]+)"[^>]*>',re.IGNORECASE)
for m in r.finditer(html):
print m.group('image')

I got these rusults:
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/wthr.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif

On 6/8/06, Paul McGuire <[EMAIL PROTECTED]> wrote:
> <[EMAIL PROTECTED]> wrote in message
> news:[EMAIL PROTECTED]
> > Hi,
> > I am new to python regular expression, I would like to use it to get an
> > attribute of an html element from an html file?
> >
> > for example, I was able to read the html file using this:
> >req = urllib2.Request(url=acaURL)
> > f = urllib2.urlopen(req)
> >
> > data = f.read()
> >
> > my question is how can I just get the src attribute value of an img
> > tag?
> > something like this:
> > (.*)(.*)
> >
> > I need to get the href of the image source.
> >
> > Thanks.
> >
>
> As Fredrik pointed out, re's are not the only tool out there.  Here's a
> pyparsing solution.
>
> -- Paul
>
>
> import pyparsing
> import urllib
>
> # define HTML tag format using makeHTMLTags helper
> # (we don't really care about the ending  tag,
> # even though makeHTMLTags returns definitions for both
> # starting and ending tag patterns)
> imgStartTag, dummy = pyparsing.makeHTMLTags("img")
>
> # get HTML source from some web site
> htmlPage = urllib.urlopen("http://www.yahoo.com";)
> htmlSource = htmlPage.read()
> htmlPage.close()
>
> # scan HTML source, printing SRC attribute from each  tag
> for tokens,start,end in imgStartTag.scanString(htmlSource):
> print tokens.src
>
>
> Prints:
>
> http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
> http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
> http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/hea_0411.gif
> http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/img_0607.jpg
> http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg
> http://us.i1.yimg.com/us.yimg.com/i/ww/beta/news/video.gif
> http://us.i1.yimg.com/us.yimg.com/i/buzz/2006/06/wholefoodssmall.jpg
> http://us.i1.yimg.com/us.yimg.com/i/mntl/msg/06q2/img_im.jpg
> http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
> http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2006-06-07 Thread Paul McGuire
<[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Hi,
> I am new to python regular expression, I would like to use it to get an
> attribute of an html element from an html file?
>
> for example, I was able to read the html file using this:
>req = urllib2.Request(url=acaURL)
> f = urllib2.urlopen(req)
>
> data = f.read()
>
> my question is how can I just get the src attribute value of an img
> tag?
> something like this:
> (.*)(.*)
>
> I need to get the href of the image source.
>
> Thanks.
>

As Fredrik pointed out, re's are not the only tool out there.  Here's a
pyparsing solution.

-- Paul


import pyparsing
import urllib

# define HTML tag format using makeHTMLTags helper
# (we don't really care about the ending  tag,
# even though makeHTMLTags returns definitions for both
# starting and ending tag patterns)
imgStartTag, dummy = pyparsing.makeHTMLTags("img")

# get HTML source from some web site
htmlPage = urllib.urlopen("http://www.yahoo.com";)
htmlSource = htmlPage.read()
htmlPage.close()

# scan HTML source, printing SRC attribute from each  tag
for tokens,start,end in imgStartTag.scanString(htmlSource):
print tokens.src


Prints:

http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif
http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/hea_0411.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/img_0607.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/beta/news/video.gif
http://us.i1.yimg.com/us.yimg.com/i/buzz/2006/06/wholefoodssmall.jpg
http://us.i1.yimg.com/us.yimg.com/i/mntl/msg/06q2/img_im.jpg
http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif
http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Re: Regular Expression question

2006-06-07 Thread 李政
I'm sorry! I mean pattern is an argument of the function, in this case, how I process special charactors.    patter = 'www.'   # not this       if re.compile(pattern).match(string) is not None:      ..     but not:         if re.compile(r'www.').match(string) is not None:     or      if re.compile('www\.').match(string) is
 not None:      , how you process special characters, like dot.Fredrik Lundh <[EMAIL PROTECTED]> wrote:   [EMAIL PROTECTED] wrote:> I am new to python regular _expression, I would like to use it to get an> attribute of an html element from an html file?if you want to parse HTML, use an HTML parser. if you want to parse sloppy HTML, use a tolerant HTML parser:http://www.crummy.com/software/BeautifulSoup/-- http://mail.python.org/mailman/listinfo/python-list __赶快注册雅虎超大容量免费邮箱?http://cn.mail.yahoo.com-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular Expression question

2006-06-07 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:

> I am new to python regular expression, I would like to use it to get an
> attribute of an html element from an html file?

if you want to parse HTML, use an HTML parser.  if you want to parse 
sloppy HTML, use a tolerant HTML parser:

 http://www.crummy.com/software/BeautifulSoup/



-- 
http://mail.python.org/mailman/listinfo/python-list


Regular Expression question

2006-06-07 Thread ken . carlino
Hi,
I am new to python regular expression, I would like to use it to get an
attribute of an html element from an html file?

for example, I was able to read the html file using this:
   req = urllib2.Request(url=acaURL)
f = urllib2.urlopen(req)

data = f.read()

my question is how can I just get the src attribute value of an img
tag?
something like this:
(.*)(.*)

I need to get the href of the image source.

Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression question

2005-12-01 Thread Fredrik Lundh
Michelle McCall wrote:

>I have a script that needs to scan every line of a file for numerous
> strings.  There are groups of strings for each "area" of data we are looking
> for.  Looping through each of these list of strings separately for each line
> has slowed execution to a crawl.  Can I create ONE regular expression from a
> group of strings such that when I perform a search on a line from the file
> with this RE it will search the line for each one of the strings in the RE ?

does

m = re.search("spam|egg|bacon", line)

do what you want?

if you need all matches, you can use

for m in re.finditer("spam|egg|bacon", line):
...

if the strings are all literal strings (i.e. no subpatterns), a little 
preparation might
speed things up:

words = ["spam", "spim", "spum", "spamwall", "wallspam"]
words.sort() # lexical order
words.reverse() # look for longest match first
pattern = "|".join(map(re.escape, words))
pattern = re.compile(pattern)

for m in pattern.finditer(line):
...

 



-- 
http://mail.python.org/mailman/listinfo/python-list


Regular Expression question

2005-12-01 Thread Michelle McCall
I have a script that needs to scan every line of a file for numerous
strings.  There are groups of strings for each "area" of data we are looking
for.  Looping through each of these list of strings separately for each line
has slowed execution to a crawl.  Can I create ONE regular expression from a
group of strings such that when I perform a search on a line from the file
with this RE it will search the line for each one of the strings in the RE ?

Michelle
<>-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular expression question -- exclude substring

2005-11-07 Thread Bengt Richter
On Mon, 7 Nov 2005 16:38:11 -0800, James Stroud <[EMAIL PROTECTED]> wrote:

>On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote:
>> Ya, for some reason your non-greedy "?" doesn't seem to be taking.
>> This works:
>>
>> re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
>
>The non-greedy is actually acting as expected. This is because non-greedy 
>operators are "forward looking", not "backward looking". So the non-greedy 
>finds the start of the first start-of-the-match it comes accross and then 
>finds the first occurrence of '01' that makes the complete match, otherwise 
>the greedy operator would match .* as much as it could, gobbling up all '01's 
>before the last because these match '.*'. For example:
>
>py> rgx = re.compile(r"(00.*01) target_mark")
>py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
>['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
>py> rgx = re.compile(r"(00.*?01) target_mark")
>py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
>['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
>
>My understanding is that backward looking operators are very resource 
>expensive to implement.
>
If the delimiting strings are fixed, we can use plain python string methods, 
e.g.,
(not tested beyond what you see ;-)

 >>> s = "00 noise1 01 noise2 00 target 01 target_mark"

 >>> def findit(s, beg='00', end='01', tmk=' target_mark'):
 ... start = 0
 ... while True:
 ... t = s.find(tmk, start)
 ... if t<0: break
 ... start = s.rfind(beg, start, t)
 ... if start<0: break
 ... e = s.find(end, start, t)
 ... if e+len(end)==t: # _just_ after
 ... yield s[start:e+len(end)]
 ... start = t+len(tmk)
 ...
 >>> list(findit(s))
 ['00 target 01']
 >>> s2 = s + ' garbage noise3 00 almost 01  target_mark 00 success 01 
 >>> target_mark'
 >>> list(findit(s2))
 ['00 target 01', '00 success 01']

(I didn't enforce exact adjacency the first time, obviously it would be more 
efficient
to search for end+tmk instead of tmk and back to beg and forward to end ;-)

If there can be spurious target_marks, and tricky matching spans, additional 
logic may be needed.
Too lazy to think about it ;-)

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread James Stroud
On Monday 07 November 2005 17:31, Kent Johnson wrote:
> James Stroud wrote:
> > On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote:
> >>Ya, for some reason your non-greedy "?" doesn't seem to be taking.
> >>This works:
> >>
> >>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
> >
> > The non-greedy is actually acting as expected. This is because non-greedy
> > operators are "forward looking", not "backward looking". So the
> > non-greedy finds the start of the first start-of-the-match it comes
> > accross and then finds the first occurrence of '01' that makes the
> > complete match, otherwise the greedy operator would match .* as much as
> > it could, gobbling up all '01's before the last because these match '.*'.
> > For example:
> >
> > py> rgx = re.compile(r"(00.*01) target_mark")
> > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
> > 01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
> > py> rgx = re.compile(r"(00.*?01) target_mark")
> > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
> > 01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
>
> ??? not in my Python:
>  >>> rgx = re.compile(r"(00.*01) target_mark")
>  >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>  >>> 01')
>
> ['00 noise1 01 noise2 00 target 01']
>
>  >>> rgx = re.compile(r"(00.*?01) target_mark")
>  >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>  >>> 01')
>
> ['00 noise1 01 noise2 00 target 01']
>
> Since target_mark only occurs once in the string the greedy and non-greedy
> match is the same in this case.

Somehow my cutting and pasting got messed up. It should be:

py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 
target_mark')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 
target_mark')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']

Sorry about that.

James

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread Kent Johnson
James Stroud wrote:
> On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote:
> 
>>Ya, for some reason your non-greedy "?" doesn't seem to be taking.
>>This works:
>>
>>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
> 
> 
> The non-greedy is actually acting as expected. This is because non-greedy 
> operators are "forward looking", not "backward looking". So the non-greedy 
> finds the start of the first start-of-the-match it comes accross and then 
> finds the first occurrence of '01' that makes the complete match, otherwise 
> the greedy operator would match .* as much as it could, gobbling up all '01's 
> before the last because these match '.*'. For example:
> 
> py> rgx = re.compile(r"(00.*01) target_mark")
> py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
> ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
> py> rgx = re.compile(r"(00.*?01) target_mark")
> py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
> ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

??? not in my Python:
 >>> rgx = re.compile(r"(00.*01) target_mark")
 >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01']
 >>> rgx = re.compile(r"(00.*?01) target_mark")
 >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy 
match is the same in this case.

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread James Stroud
On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote:
> Ya, for some reason your non-greedy "?" doesn't seem to be taking.
> This works:
>
> re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy 
operators are "forward looking", not "backward looking". So the non-greedy 
finds the start of the first start-of-the-match it comes accross and then 
finds the first occurrence of '01' that makes the complete match, otherwise 
the greedy operator would match .* as much as it could, gobbling up all '01's 
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource 
expensive to implement.

James

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread google
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread Kent Johnson
[EMAIL PROTECTED] wrote:
> Hi,
> 
> I'm having trouble extracting substrings using regular expression. Here
> is my problem:
> 
> Want to find the substring that is immediately before a given
> substring. For example: from
> "00 noise1 01 noise2 00 target 01 target_mark",
> want to get
> "00 target 01"
> which is before
> "target_mark".
> My regular expression
> "(00.*?01) target_mark"
> will extract
> "00 noise1 01 noise2 00 target 01".

If there is a character that can't appear in the bit between the numbers then 
use everything-but-that instead of . - for example if spaces can only appear as 
you show them, use
"(00 [^ ]* 01) target_mark" or
"(00 \S* 01) target_mark"

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list


Regular expression question -- exclude substring

2005-11-07 Thread dreamerbin
Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Hopefully simple regular expression question

2005-06-14 Thread [EMAIL PROTECTED]
Thank you! I had totally forgot about that. It works.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Hopefully simple regular expression question

2005-06-14 Thread TZOTZIOY
On 14 Jun 2005 04:01:58 -0700, rumours say that "[EMAIL PROTECTED]"
<[EMAIL PROTECTED]> might have written:

>I want to match a word against a string such that 'peter' is found in
>"peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or
>"hey peterbe," because the word has to stand on its own. The following
>code works for a single word:

[snip]

use \b before and after the word you search, for example:

rePeter= re.compile("\bpeter\b", re.I)

In the documentation for the re module, Subsection 4.2.1 is Regular
Expression Syntax; it'll help a lot if you read it.

Cheers.
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Hopefully simple regular expression question

2005-06-14 Thread Kalle Anke
On Tue, 14 Jun 2005 13:01:58 +0200, [EMAIL PROTECTED] wrote
(in article <[EMAIL PROTECTED]>):

> How do I modify my regular expression to match on expressions as well
> as just single words??

import re

def createStandaloneWordRegex(word):
""" return a regular expression that can find 'peter' only if it's
written alone (next to space, start of string, end of string,
comma, etc) but not if inside another word like peterbe """

return re.compile(r'\b' + word + r'\b', re.I)


def test_createStandaloneWordRegex():
def T(word, text):
print createStandaloneWordRegex(word).findall(text)

T("peter", "So Peter Bengtsson wrote this")
T("peter", "peter")
T("peter bengtsson", "So Peter Bengtsson wrote this")
test_createStandaloneWordRegex()

Works?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Hopefully simple regular expression question

2005-06-14 Thread John Machin
[EMAIL PROTECTED] wrote:
> I want to match a word against a string such that 'peter' is found in
> "peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or
> "hey peterbe," because the word has to stand on its own. The following
> code works for a single word:
> 
> def createStandaloneWordRegex(word):
> """ return a regular expression that can find 'peter' only if it's
> written
> alone (next to space, start of string, end of string, comma, etc)
> but
> not if inside another word like peterbe """
> return re.compile(r"""
>   (
>   ^ %s
>   (?=\W | $)
>   |
>   (?<=\W)
>   %s
>   (?=\W | $)
>   )
>   """% (word, word), re.I|re.L|re.M|re.X)
> 
> 
> def test_createStandaloneWordRegex():
> def T(word, text):
> print createStandaloneWordRegex(word).findall(text)
> 
> T("peter", "So Peter Bengtsson wrote this")
> T("peter", "peter")
> T("peter bengtsson", "So Peter Bengtsson wrote this")
> 
> The result of running this is::
> 
>  ['Peter']
>  ['peter']
>  []   <--- this is the problem!!
> 
> 
> It works if the parameter is just one word (eg. 'peter') but stops
> working when it's an expression (eg. 'peter bengtsson')

No, not when it's an "expression" (whatever that means), but when the 
parameter contains whitespace, which is ignored in verbose mode.

> 
> How do I modify my regular expression to match on expressions as well
> as just single words??
> 

If you must stick with re.X, you must escape any whitespace characters 
in your "word" -- see re.escape().

Alternatively (1), drop re.X but this is ugly:

regex_text_no_X = r"(^%s(?=\W|$)|(?<=\W)%s(?=\W|$))" % (word, word)

Alternatively (2), consider using the \b gadget; this appears to give 
the same answers as the baroque method:

regex_text_no_flab = r"\b%s\b" % word


HTH,
John



-- 
http://mail.python.org/mailman/listinfo/python-list


Hopefully simple regular expression question

2005-06-14 Thread [EMAIL PROTECTED]
I want to match a word against a string such that 'peter' is found in
"peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or
"hey peterbe," because the word has to stand on its own. The following
code works for a single word:

def createStandaloneWordRegex(word):
""" return a regular expression that can find 'peter' only if it's
written
alone (next to space, start of string, end of string, comma, etc)
but
not if inside another word like peterbe """
return re.compile(r"""
  (
  ^ %s
  (?=\W | $)
  |
  (?<=\W)
  %s
  (?=\W | $)
  )
  """% (word, word), re.I|re.L|re.M|re.X)


def test_createStandaloneWordRegex():
def T(word, text):
print createStandaloneWordRegex(word).findall(text)

T("peter", "So Peter Bengtsson wrote this")
T("peter", "peter")
T("peter bengtsson", "So Peter Bengtsson wrote this")

The result of running this is::

 ['Peter']
 ['peter']
 []   <--- this is the problem!!


It works if the parameter is just one word (eg. 'peter') but stops
working when it's an expression (eg. 'peter bengtsson')

How do I modify my regular expression to match on expressions as well
as just single words??

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regular expression question

2005-02-14 Thread Fredrik Lundh
Bruno Desthuilliers wrote:

>> match = STX + '(.*)' + ETX
>>
>> # Example 1
>> # This appears to work, but I'm not sure if the '+' is being used in
>> the regular expression, or if it's just joining STX, '(.*)', and ETX.
>>
>> if re.search(STX + '(.*)' + ETX,data):
>>   print "Matches"
>>
>> # Example 2
>> # This also appears to work
>> if re.search(match,data):
>>   print "Matches"

> You may want something like:
> if re.search('%s(.*)%s' % (STX, ETX), data):
>   ...

that's of course the same thing as examples 1 and 2.

a tip to the original poster: if you're not sure what an expression does,
try printing the result.  use "print repr(v)" if the value may contain odd
characters.  try adding this to your test script:

print repr(match)
print repr(STX + '(.*)' + ETX)
print repr('%s(.*)%s' % (STX, ETX))

 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regular expression question

2005-02-14 Thread snacktime
> You may want something like:
> if re.search('%s(.*)%s' % (STX, ETX), data):
>

Ah I didn't even think about that...

Chris
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regular expression question

2005-02-14 Thread Bruno Desthuilliers
snacktime a écrit :
The primary question is how do I perform a match when the regular
expression contains string variables?  For example, in the following
code I want to match a line that starts with STX, then has any number
of characters, then ends with STX.
Example 2 I'm pretty sure works as I expect, but I'm not sure about
Example 1, and I'm pretty sure about example 3.
import re
from curses.ascii import STX,ETX,FS
STX =  chr(STX)
ETX =  chr(ETX)
FS =  chr(FS)
data = STX + "ONE" + FS + "TWO" + FS + "THREE" + ETX
match = STX + '(.*)' + ETX
# Example 1
# This appears to work, but I'm not sure if the '+' is being used in
the regular expression, or if it's just joining STX, '(.*)', and ETX.
if re.search(STX + '(.*)' + ETX,data):
  print "Matches"
# Example 2
# This also appears to work
if re.search(match,data):
  print "Matches"
# Example 3
# Doesn't work, as STX and ETX are evaluated as the literal strings
'STX' and 'ETX'
if re.search('STX(.*)ETX', data):
  print "Matches"
You may want something like:
if re.search('%s(.*)%s' % (STX, ETX), data):
  ...
BTW, given your requirements, I'd write this:
if re.search('^%s(.*)%s$' % (STX, ETX), data):
  ...

Chris
--
http://mail.python.org/mailman/listinfo/python-list


regular expression question

2005-02-14 Thread snacktime
The primary question is how do I perform a match when the regular
expression contains string variables?  For example, in the following
code I want to match a line that starts with STX, then has any number
of characters, then ends with STX.
Example 2 I'm pretty sure works as I expect, but I'm not sure about
Example 1, and I'm pretty sure about example 3.

import re
from curses.ascii import STX,ETX,FS
STX =  chr(STX)
ETX =  chr(ETX)
FS =  chr(FS)
data = STX + "ONE" + FS + "TWO" + FS + "THREE" + ETX
match = STX + '(.*)' + ETX

# Example 1
# This appears to work, but I'm not sure if the '+' is being used in
the regular expression, or if it's just joining STX, '(.*)', and ETX.

if re.search(STX + '(.*)' + ETX,data):
  print "Matches"

# Example 2
# This also appears to work
if re.search(match,data):
  print "Matches"

# Example 3
# Doesn't work, as STX and ETX are evaluated as the literal strings
'STX' and 'ETX'
if re.search('STX(.*)ETX', data):
  print "Matches"

Chris
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Simple (newbie) regular expression question

2005-01-21 Thread André Roberge
John Machin wrote:
André Roberge wrote:
Sorry for the simple question, but I find regular
expressions rather intimidating.  And I've never
needed them before ...
How would I go about to 'define' a regular expression that
would identify strings like
__alphanumerical__  as in __init__
(Just to spell things out, as I have seen underscores disappear
from messages before, that's  2 underscores immediately
followed by an alphanumerical string immediately followed
by 2 underscore; in other words, a python 'private' method).
Simple one-liner would be good.
One-liner with explanation would be better.
One-liner with explanation, and pointer to 'great tutorial'
(for future reference) would probably be ideal.
(I know, google is my friend for that last part. :-)
Andre

Firstly, some corrections: (1) google is your friend for _all_ parts of
your question (2) Python has an initial P and doesn't have private
methods.
Read this:

pat1 = r'__[A-Za-z0-9_]*__'
pat2 = r'__\w*__'
import re
tests = ['x', '__', '', '_', '__!__', '__a__', '__Z__',
'__8__', '__xyzzy__', '__plugh']
[x for x in tests if re.search(pat1, x)]
['', '_', '__a__', '__Z__', '__8__', '__xyzzy__']
[x for x in tests if re.search(pat2, x)]
['', '_', '__a__', '__Z__', '__8__', '__xyzzy__']
I've interpreted your question as meaning "valid Python identifier that
starts and ends with two [implicitly, or more] underscores".
In the two alternative patterns, the part in the middle says "zero or
more instances of a character that can appear in the middle of a Python
identifier". The first pattern spells this out as "capital letters,
small letters, digits, and underscore". The second pattern uses the \w
shorthand to give the same effect.
You should be able to follow that from the Python documentation.
Now, read this: http://www.amk.ca/python/howto/regex/
HTH,
John
Thanks for it all. It does help!
André
--
http://mail.python.org/mailman/listinfo/python-list


Re: Simple (newbie) regular expression question

2005-01-21 Thread John Machin

André Roberge wrote:
> Sorry for the simple question, but I find regular
> expressions rather intimidating.  And I've never
> needed them before ...
>
> How would I go about to 'define' a regular expression that
> would identify strings like
> __alphanumerical__  as in __init__
> (Just to spell things out, as I have seen underscores disappear
> from messages before, that's  2 underscores immediately
> followed by an alphanumerical string immediately followed
> by 2 underscore; in other words, a python 'private' method).
>
> Simple one-liner would be good.
> One-liner with explanation would be better.
>
> One-liner with explanation, and pointer to 'great tutorial'
> (for future reference) would probably be ideal.
> (I know, google is my friend for that last part. :-)
>
> Andre

Firstly, some corrections: (1) google is your friend for _all_ parts of
your question (2) Python has an initial P and doesn't have private
methods.

Read this:

>>> pat1 = r'__[A-Za-z0-9_]*__'
>>> pat2 = r'__\w*__'
>>> import re
>>> tests = ['x', '__', '', '_', '__!__', '__a__', '__Z__',
'__8__', '__xyzzy__', '__plugh']
>>> [x for x in tests if re.search(pat1, x)]
['', '_', '__a__', '__Z__', '__8__', '__xyzzy__']
>>> [x for x in tests if re.search(pat2, x)]
['', '_', '__a__', '__Z__', '__8__', '__xyzzy__']
>>>

I've interpreted your question as meaning "valid Python identifier that
starts and ends with two [implicitly, or more] underscores".

In the two alternative patterns, the part in the middle says "zero or
more instances of a character that can appear in the middle of a Python
identifier". The first pattern spells this out as "capital letters,
small letters, digits, and underscore". The second pattern uses the \w
shorthand to give the same effect.
You should be able to follow that from the Python documentation.
Now, read this: http://www.amk.ca/python/howto/regex/

HTH,

John

--
http://mail.python.org/mailman/listinfo/python-list


Simple (newbie) regular expression question

2005-01-21 Thread André Roberge
Sorry for the simple question, but I find regular
expressions rather intimidating.  And I've never
needed them before ...
How would I go about to 'define' a regular expression that
would identify strings like
__alphanumerical__  as in __init__
(Just to spell things out, as I have seen underscores disappear
from messages before, that's  2 underscores immediately
followed by an alphanumerical string immediately followed
by 2 underscore; in other words, a python 'private' method).
Simple one-liner would be good.
One-liner with explanation would be better.
One-liner with explanation, and pointer to 'great tutorial'
(for future reference) would probably be ideal.
(I know, google is my friend for that last part. :-)
Andre
--
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-31 Thread It's me
Oops!

Sorry, didn't realize that.

Thanks,

"M.E.Farmer" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
>
> It's me wrote:
> > The shlex.py needs quite a number of .py files.  I tried to hunt down
> a few
> > of them and got really tire.
> >
> > Is there one batch of .py files that I can download from somewhere?
> >
> > Thanks,
> Not sure what you mean by this.
> Shlex is a standard library module.
> It imports os and sys only, they are standard library modules.
> If you have python you have them already.
> If you mean cStringIO it is in the standard library(at least on my
> system).
> You dont have to use it just feed shlex an open file.
> py>lexer = shlex.shlex(open('myrecord.txt', 'r'))
>
> Hth,
> M.E.Farmer
>


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread M.E.Farmer

It's me wrote:
> The shlex.py needs quite a number of .py files.  I tried to hunt down
a few
> of them and got really tire.
>
> Is there one batch of .py files that I can download from somewhere?
>
> Thanks,
Not sure what you mean by this.
Shlex is a standard library module.
It imports os and sys only, they are standard library modules.
If you have python you have them already.
If you mean cStringIO it is in the standard library(at least on my
system).
You dont have to use it just feed shlex an open file.
py>lexer = shlex.shlex(open('myrecord.txt', 'r'))

Hth,
M.E.Farmer

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread It's me
The shlex.py needs quite a number of .py files.  I tried to hunt down a few
of them and got really tire.

Is there one batch of .py files that I can download from somewhere?

Thanks,


"M.E.Farmer" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Hello me,
> Have you tried shlex.py it is a tokenizer for writing lexical
> parsers.
> Should be a breeze to whip something up with it.
> an example of tokenizing:
> py>import shlex
> py># fake an open record
> py>import cStringIO
> py>myfakeRecord = cStringIO.StringIO()
> py>myfakeRecord.write("['1','2'] \n 'fdfdfdfd' \n 'dfdfdfdfd'
> ['1','2']\n")
> py>myfakeRecord.seek(0)
> py>lexer = shlex.shlex(myfakeRecord)
>
> py>lexer.get_token()
> '['
> py>lexer.get_token()
> '1'
> py>lexer.get_token()
> ','
> py>lexer.get_token()
> '2'
> py>lexer.get_token()
> ']'
> py>lexer.get_token()
> 'fdfdfdfd'
>
> You can do a lot with it that is just a teaser.
> M.E.Farmer
>


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread M.E.Farmer
Hello me,
Have you tried shlex.py it is a tokenizer for writing lexical
parsers.
Should be a breeze to whip something up with it.
an example of tokenizing:
py>import shlex
py># fake an open record
py>import cStringIO
py>myfakeRecord = cStringIO.StringIO()
py>myfakeRecord.write("['1','2'] \n 'fdfdfdfd' \n 'dfdfdfdfd'
['1','2']\n")
py>myfakeRecord.seek(0)
py>lexer = shlex.shlex(myfakeRecord)

py>lexer.get_token()
'['
py>lexer.get_token()
'1'
py>lexer.get_token()
','
py>lexer.get_token()
'2'
py>lexer.get_token()
']'
py>lexer.get_token()
'fdfdfdfd'

You can do a lot with it that is just a teaser.
M.E.Farmer

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread It's me
I'll chew on this.  Thanks, got to go.


"Steve Holden" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> It's me wrote:
>
> > I am never very good with regular expressions.  My head always hurts
> > whenever I need to use it.
> >
> Well, they are a pain to more than just you, and the conventional advice
> is "even when you are convinced you need to use REs, try and find
> another way".
>
> > I need to read a data file and parse each data record.  Each item on the
> > data record begins with either a string, or a list of strings.  I
searched
> > around and didn't see any existing Python packages that does that.
> > scanf.py, for instance, can do standard items but doesn't know about
list.
> > So, I figure I might have to write a lex engine for it and of course I
have
> > to deal wit RE again.
> >
> Well, you haven't yet convinced me that you *have* to. Personally, I
> think you just like trouble :-)
>
> > But I run into problem right from the start.   To recognize a list, I
need a
> > RE for the string:
> >
> > 1) begin with ["  (left bracket followed by a double quote with zero or
more
> > spaces in between)
> > 2) followed by any characters until ] but only if that left bracket is
not
> > preceeded by the escape character \.
> >
> So the pattern is
>
> 1. If the line begins with a "[" it should end with a "]"
>
> 2. Otherwise, it shouldn't?
>
> I'm trying to gently point out that the syntax you want to accept isn't
> actually very clear. If the format is "Python strings and lists of
> strings" then you might want to use the Python lexer to parse them, but
> that's quite an advanced topic. [too advanced for me :-]
>
> The problem is matching "up to a right bracket not preceded by a
> backslash". This seems to require what's technically referred to as a
> "negative lookbehind assertion" - in other words, a pattern that doesn't
> match anything, but checks that a specific condition is false or fails.
>
> > So, I tried:
> >
> > ^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*]
> >
> > and tested with:
> >
> > ["This line\] works"]
> >
> > but it fails with:
> >
> > ["This line fails"]
> >
> > I would have thought that:
> >
> >(\\\])*
> >
> > should work because it's zero or more incidence of the pattern \]
> >
> > Any help is greatly appreciated.
> >
> > Sorry for beign OT.  I posted this question at the lex group and didn't
get
> > any response.  I figure may be somebody would know around here.
>
> I'd start with baby steps. First of all, make sure that you can match
> the individual strings. Then use that pattern, parenthesized to turn it
> into a group, as a component in a more complex pattern.
>
> Do you want to treat "this is also \" a string" as an allowable string?
> In that case you need a pattern that matches 'up to the first quotation
> mark not preceded by a backslash" as well!
>
> Let's try matching a single string first:
>
>   >>> s = re.compile(r'(".*?(?   >>> s.match('"s1", "s2"').groups()
> ('"s1"',)
>
> Note that I followed the "*" with a "?" to stop it being greedy, and
> matching as many characters as it could. OK, does that work when we have
> escaped quotation marks?
>
>   >>> s.match(r'"s1\"\"", "s2"').groups()
> ('"s1\\"\\""',)
>
> Apparently so. The negative lookbehind assertion stops a quote from
> matching when it's preceded by a backslash. Can we match a
> comma-separated list of such strings?
>
>   >>> slpat = r'(".*?(?   >>> s = re.compile(slpat)
>
> This is a bit trickier: here the second grouping beginning with "(?:" is
> intended to ensure that only the strings that get matched are included
> in the groups, not the separators, even though they must be grouped
> together. The list *must* be separated by ", ", but you could alter the
> pattern to allow zero or more whitespace characters.
>
>   >>> s.match(r'"s1\"\"", "s2"').groups()
> ('"s1\\"\\""', '"s2"')
>
> Well, that seems to work. Note that these patterns all ignore bracket
> characters, so all you need to do now is to surround them with patterns
> to match the opening and closing brackets, and you're done (I hope).
>
> Anyway, it'll give you a few ideas to work with.
>
> regards
>   Steve
> -- 
> Steve Holden   http://www.holdenweb.com/
> Python Web Programming  http://pydish.holdenweb.com/
> Holden Web LLC  +1 703 861 4237  +1 800 494 3119


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread RyanMorillo
check jgsoft dot com, they have2 things witch may help.  Edit pad pro
(the test version has a good tutorial) or power grep (if you do a lot
of regexes, or the mastering regular expressions book from Orielly (if
yo do a lot of regex work)

Also the perl group would be good for regexes (pythons are Perl 5
compatable)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: OT: novice regular expression question

2004-12-30 Thread Steve Holden
It's me wrote:
I am never very good with regular expressions.  My head always hurts
whenever I need to use it.
Well, they are a pain to more than just you, and the conventional advice 
is "even when you are convinced you need to use REs, try and find 
another way".

I need to read a data file and parse each data record.  Each item on the
data record begins with either a string, or a list of strings.  I searched
around and didn't see any existing Python packages that does that.
scanf.py, for instance, can do standard items but doesn't know about list.
So, I figure I might have to write a lex engine for it and of course I have
to deal wit RE again.
Well, you haven't yet convinced me that you *have* to. Personally, I 
think you just like trouble :-)

But I run into problem right from the start.   To recognize a list, I need a
RE for the string:
1) begin with ["  (left bracket followed by a double quote with zero or more
spaces in between)
2) followed by any characters until ] but only if that left bracket is not
preceeded by the escape character \.
So the pattern is
1. If the line begins with a "[" it should end with a "]"
2. Otherwise, it shouldn't?
I'm trying to gently point out that the syntax you want to accept isn't 
actually very clear. If the format is "Python strings and lists of 
strings" then you might want to use the Python lexer to parse them, but 
that's quite an advanced topic. [too advanced for me :-]

The problem is matching "up to a right bracket not preceded by a 
backslash". This seems to require what's technically referred to as a 
"negative lookbehind assertion" - in other words, a pattern that doesn't 
match anything, but checks that a specific condition is false or fails.

So, I tried:
^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*]
and tested with:
["This line\] works"]
but it fails with:
["This line fails"]
I would have thought that:
   (\\\])*
should work because it's zero or more incidence of the pattern \]
Any help is greatly appreciated.
Sorry for beign OT.  I posted this question at the lex group and didn't get
any response.  I figure may be somebody would know around here.
I'd start with baby steps. First of all, make sure that you can match 
the individual strings. Then use that pattern, parenthesized to turn it 
into a group, as a component in a more complex pattern.

Do you want to treat "this is also \" a string" as an allowable string? 
In that case you need a pattern that matches 'up to the first quotation 
mark not preceded by a backslash" as well!

Let's try matching a single string first:
 >>> s = re.compile(r'(".*?(?>> s.match('"s1", "s2"').groups()
('"s1"',)
Note that I followed the "*" with a "?" to stop it being greedy, and 
matching as many characters as it could. OK, does that work when we have 
escaped quotation marks?

 >>> s.match(r'"s1\"\"", "s2"').groups()
('"s1\\"\\""',)
Apparently so. The negative lookbehind assertion stops a quote from 
matching when it's preceded by a backslash. Can we match a 
comma-separated list of such strings?

 >>> slpat = r'(".*?(?>> s = re.compile(slpat)
This is a bit trickier: here the second grouping beginning with "(?:" is 
intended to ensure that only the strings that get matched are included 
in the groups, not the separators, even though they must be grouped 
together. The list *must* be separated by ", ", but you could alter the 
pattern to allow zero or more whitespace characters.

 >>> s.match(r'"s1\"\"", "s2"').groups()
('"s1\\"\\""', '"s2"')
Well, that seems to work. Note that these patterns all ignore bracket 
characters, so all you need to do now is to surround them with patterns 
to match the opening and closing brackets, and you're done (I hope).

Anyway, it'll give you a few ideas to work with.
regards
 Steve
--
Steve Holden   http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC  +1 703 861 4237  +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list


OT: novice regular expression question

2004-12-30 Thread It's me
I am never very good with regular expressions.  My head always hurts
whenever I need to use it.

I need to read a data file and parse each data record.  Each item on the
data record begins with either a string, or a list of strings.  I searched
around and didn't see any existing Python packages that does that.
scanf.py, for instance, can do standard items but doesn't know about list.
So, I figure I might have to write a lex engine for it and of course I have
to deal wit RE again.

But I run into problem right from the start.   To recognize a list, I need a
RE for the string:

1) begin with ["  (left bracket followed by a double quote with zero or more
spaces in between)
2) followed by any characters until ] but only if that left bracket is not
preceeded by the escape character \.

So, I tried:

^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*]

and tested with:

["This line\] works"]

but it fails with:

["This line fails"]

I would have thought that:

   (\\\])*

should work because it's zero or more incidence of the pattern \]

Any help is greatly appreciated.

Sorry for beign OT.  I posted this question at the lex group and didn't get
any response.  I figure may be somebody would know around here.


-- 
http://mail.python.org/mailman/listinfo/python-list