Re: A regular expression question
Cpcp Cp writes: > Look this > > >>> import re > >>> text="asdfnbd]" > >>> m=re.sub("n*?","?",text) > >>> print m > ?a?s?d?f?n?b?d?]? > > I don't understand the 'non-greedy' pattern. Since ‘n*’ matches zero or more ‘n’s, it matches zero adjacent to every actual character. It's non-greedy because it matches as few characters as will allow the match to succeed. > I think the repl argument should replaces every char in text and > outputs "". I hope that helps you understand why that expectation is wrong :-) Regular expression patterns are *not* an easy topic. Try experimenting and learning with http://www.regexr.com/>. -- \ “If I haven't seen as far as others, it is because giants were | `\ standing on my shoulders.” —Hal Abelson | _o__) | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
A regular expression question
Look this >>> import re >>> text="asdfnbd]" >>> m=re.sub("n*?","?",text) >>> print m ?a?s?d?f?n?b?d?]? I don't understand the 'non-greedy' pattern. I think the repl argument should replaces every char in text and outputs "". -- https://mail.python.org/mailman/listinfo/python-list
Re: regular expression question (re module)
On Oct 16, 2008, at 11:25 PM, Steve Holden wrote: Pat wrote: Faheem Mitha wrote: Hi, I need to match a string of the form capital_letter underscore capital_letter number against a string of the form anything capital_letter underscore capital_letter number some_stuff_not_starting with a number DUKE1_plateD_A12.CEL. Thanks in advance. Please cc me with any reply. Faheem. While I can't provide you with an answer, I can say that I've been using RegExBuddy (for Windows, about $40, 90 day money back guarantee, http://www.regexbuddy.com/) for quite a few months now and it's greatly helped me with creating/learning/debugging regexps. You put in your regexp in the top field and all the possibilities in the bottom field. Whatever matches is instantly highlighted. You keep modifying your RE until only the correct matches are highlighted. Talk about instant gratification! No, I'm in no way affiliated with this company. There's also a free *IX version that's quite similar to RegExBuddy but I don't have the name since I'm writing this while on a Windows platform. -- http://mail.python.org/mailman/listinfo/python-list Or you could use the Kodos tool, written in Python and well worth a trial since it's free. Google is, as always, your friend in locating it. I use this one as my regex playground: http://cthedot.de/retest/ -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression question (re module)
Pat wrote: > Faheem Mitha wrote: >> Hi, >> >> I need to match a string of the form >> >> capital_letter underscore capital_letter number >> >> against a string of the form >> >> anything capital_letter underscore capital_letter number >> some_stuff_not_starting with a number >> > >> DUKE1_plateD_A12.CEL. >> >> Thanks in advance. Please cc me with any reply. >> Faheem. >> > > While I can't provide you with an answer, I can say that I've been using > RegExBuddy (for Windows, about $40, 90 day money back guarantee, > http://www.regexbuddy.com/) for quite a few months now and it's greatly > helped me with creating/learning/debugging regexps. You put in your > regexp in the top field and all the possibilities in the bottom field. > Whatever matches is instantly highlighted. You keep modifying your RE > until only the correct matches are highlighted. Talk about instant > gratification! No, I'm in no way affiliated with this company. > > There's also a free *IX version that's quite similar to RegExBuddy but I > don't have the name since I'm writing this while on a Windows platform. > -- > http://mail.python.org/mailman/listinfo/python-list > Or you could use the Kodos tool, written in Python and well worth a trial since it's free. Google is, as always, your friend in locating it. regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression question (re module)
Faheem Mitha wrote: Hi, I need to match a string of the form capital_letter underscore capital_letter number against a string of the form anything capital_letter underscore capital_letter number some_stuff_not_starting with a number DUKE1_plateD_A12.CEL. Thanks in advance. Please cc me with any reply. Faheem. While I can't provide you with an answer, I can say that I've been using RegExBuddy (for Windows, about $40, 90 day money back guarantee, http://www.regexbuddy.com/) for quite a few months now and it's greatly helped me with creating/learning/debugging regexps. You put in your regexp in the top field and all the possibilities in the bottom field. Whatever matches is instantly highlighted. You keep modifying your RE until only the correct matches are highlighted. Talk about instant gratification! No, I'm in no way affiliated with this company. There's also a free *IX version that's quite similar to RegExBuddy but I don't have the name since I'm writing this while on a Windows platform. -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression question (re module)
Faheem Mitha: > I need to match a string of the form > ... Please, show the code you have written so far, with your input-output examples included (as doctests, for example), and we can try to find ways to help you remove the bugs you have. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
regular expression question (re module)
Hi, I need to match a string of the form capital_letter underscore capital_letter number against a string of the form anything capital_letter underscore capital_letter number some_stuff_not_starting with a number Eg D_A1 needs to match with DUKE1_plateD_A1.CEL, but not any of DUKE1_plateD_A10.CEL, Duke1_PlateD_A11v2.CEL, DUKE1_plateD_A12.CEL. Similarly D_A10 needs to match DUKE1_plateD_A10.CEL, but not any of DUKE1_plateD_A1.CEL, Duke1_PlateD_A11v2.CEL, DUKE1_plateD_A12.CEL. Similarly D_A11 needs to match Duke1_PlateD_A11v2.CEL, but not any of DUKE1_plateD_A1.CEL, DUKE1_plateD_A10.CEL, DUKE1_plateD_A12.CEL. Thanks in advance. Please cc me with any reply. Faheem. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
On Oct 25, 9:25 am, Peter Otten <[EMAIL PROTECTED]> wrote: > > You want a "negative lookahead assertion" then: > Now I feel dumb... I've seen the (?!...) dozen times in the doc but never figure out that it is what I'm looking for. So this one is the winner: s = re.search(r'create\s+or\s+replace\s+package\s+(?!body\s+)', txt, re.IGNORECASE) Thanks Peter and Marc. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
looping wrote: > On Oct 25, 8:49 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: >> >> needle = re.compile(r'create\s+or\s+replace\s+package(\s+body)?\s+', >> re.IGNORECASE) > > What I want here is a RE that return ONLY the line without the "body" > keyword. > Your RE return both. > I know I could use it but I want to learn how to search something that > is NOT in the string using RE. You want a "negative lookahead assertion" then: >>> import re >>> s = """Isaac Newton ... Isaac Asimov ... Isaac Singer ... """ >>> re.compile("Isaac (?!Asimov).*").findall(s) ['Isaac Newton', 'Isaac Singer'] Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
On Oct 25, 8:49 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > > needle = re.compile(r'create\s+or\s+replace\s+package(\s+body)?\s+', > re.IGNORECASE) What I want here is a RE that return ONLY the line without the "body" keyword. Your RE return both. I know I could use it but I want to learn how to search something that is NOT in the string using RE. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
On Thu, 25 Oct 2007 06:34:03 +, looping wrote: > Hi, > It's not really a Python question but I'm sure someone could help me. > > When I use RE, I always have trouble with this kind of search: > > Ex. > > I've a text file: > """ > create or replace package XXX > ... > > create or replace package body XXX > ... > """ > now I want to search the position (line) of this two string. > > for the body I use: > s = re.search(r'create\s+or\s+replace\s+package\s+body\s+', txt, > re.IGNORECASE) > > but how to search for the other line ? > I want the same RE but explicitly without "body". The write the same RE but explicitly without "body". But I guess I didn't understand your problem when the answer is that obvious. Maybe you want to iterate over the text file line by line and match or search within the line? Untested: needle = re.compile(r'create\s+or\s+replace\s+package(\s+body)?\s+', re.IGNORECASE) for i, line in enumerate(lines): if needle.match(line): print 'match in line %d' % (i + 1) Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression question
Hi, It's not really a Python question but I'm sure someone could help me. When I use RE, I always have trouble with this kind of search: Ex. I've a text file: """ create or replace package XXX ... create or replace package body XXX ... """ now I want to search the position (line) of this two string. for the body I use: s = re.search(r'create\s+or\s+replace\s+package\s+body\s+', txt, re.IGNORECASE) but how to search for the other line ? I want the same RE but explicitly without "body". Thanks for your help. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python regular expression question!
Sweet! Thanks so much! -- http://mail.python.org/mailman/listinfo/python-list
Re: Python regular expression question!
unexpected wrote: > > \b matches the beginning/end of a word (characters a-zA-Z_0-9). > > So that regex will match e.g. MULTX-FOO but not MULTX-. > > > > So is there a way to get \b to include - ? No, but you can get the behaviour you want using negative lookaheads. The following regex is effectively \b where - is treated as a word character: pattern = r"(?![a-zA-Z0-9_-])" This effectively matches the next character that isn't in the group [a-zA-Z0-9_-] but doesn't consume it. For example: >>> p = re.compile(r".*?(?![a-zA-Z0-9_-])(.*)") >>> s = "aabbcc_d-f-.XXX YYY" >>> m = p.search(s) >>> print m.group(1) .XXX YYY Note that the regex recognises the '.' as the end of the word, but doesn't use it up in the match, so it is present in the final capturing group. Contrast it with: >>> p = re.compile(r".*?[^a-zA-Z0-9_-](.*)") >>> s = "aabbcc_d-f-.XXX YYY" >>> m = p.search(s) >>> print m.group(1) XXX YYY Note here that "[^a-zA-Z0-9_-]" still denotes the end of the word, but this time consumes it, so it doesn't appear in the final captured group. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python regular expression question!
> \b matches the beginning/end of a word (characters a-zA-Z_0-9). > So that regex will match e.g. MULTX-FOO but not MULTX-. > So is there a way to get \b to include - ? -- http://mail.python.org/mailman/listinfo/python-list
Re: Python regular expression question!
"unexpected" <[EMAIL PROTECTED]> writes: > I'm trying to do a whole word pattern match for the term 'MULTX-' > > Currently, my regular expression syntax is: > > re.search(('^')+(keyword+'\\b') \b matches the beginning/end of a word (characters a-zA-Z_0-9). So that regex will match e.g. MULTX-FOO but not MULTX-. Incidentally, in case the keyword contains regex special characters (like '*') you may wish to escape it: re.escape(keyword). -- Hallvard -- http://mail.python.org/mailman/listinfo/python-list
Python regular expression question!
I'm trying to do a whole word pattern match for the term 'MULTX-' Currently, my regular expression syntax is: re.search(('^')+(keyword+'\\b') where keyword comes from a list of terms. ('MULTX-' is in this list, and hence a keyword). My regular expression works for a variety of different keywords except for 'MULTX-'. It does work for MULTX, however, so I'm thinking that the '-' sign is delimited as a word boundary. Is there any way to get Python to override this word boundary? I've tried using raw strings, but the syntax is painful. My attempts were: re.search(('^')+("r"+keyword+'\b') re.search(('^')+("r'"+keyword+'\b') and then tried the even simpler: re.search(('^')+("r'"+keyword) re.search(('^')+("r''"+keyword) and all of those failed for everything. Any suggestions? -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
Steve, I thought Fredrik Lundh's proposal was perfect. Are you now saying it doesn't solve your problem because your description of the problem was incomplete? If so, could you post a worst case piece of htm, one that contains all possible complications, or a collection of different cases all of which you need to handle? Frederic - Original Message - From: <[EMAIL PROTECTED]> Newsgroups: comp.lang.python To: Sent: Monday, August 21, 2006 11:35 PM Subject: Re: Regular Expression question > Hi, thanks everyone for the information! Still going through it :) > > The reason I did not match on tag2 in my original expression (and I > apologize because I should have mentioned this before) is that other > tags could also have an attribute with the value of "adj__" and the > attribute name may not be the same for the other tags. The only thing I > can be sure of is that the value will begin with "adj__". > > I need to match the "adj__" value with the closest preceding tag1 > irrespective of what tag the "adj__" is in, or what the attribute > holding it is called, or the order of the attributes (there may be > others). This data will be inside an html page and so there will be > plenty of html tags in the middle all of which I need to ignore. > > Thanks very much! > Steve > > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
Hi, thanks everyone for the information! Still going through it :) The reason I did not match on tag2 in my original expression (and I apologize because I should have mentioned this before) is that other tags could also have an attribute with the value of "adj__" and the attribute name may not be the same for the other tags. The only thing I can be sure of is that the value will begin with "adj__". I need to match the "adj__" value with the closest preceding tag1 irrespective of what tag the "adj__" is in, or what the attribute holding it is called, or the order of the attributes (there may be others). This data will be inside an html page and so there will be plenty of html tags in the middle all of which I need to ignore. Thanks very much! Steve -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
[EMAIL PROTECTED] wrote: > got zero results on this one :) Really? >>> s = ''' ''' >>> pat = re.compile('tag1.+?name="(.+?)".*?(?:<)(?=tag2).*?="adj__(.*?)__', >>> re.DOTALL) >>> m = re.findall(pat, s) >>> m [('john', 'tall'), ('joe', 'short')] Regards, Rob -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hi, I am having some difficulty trying to create a regular expression. > > Consider: > > > > > > > Whenever a tag1 is followed by a tag 2, I want to retrieve the values > of the tag1:name and tag2:value attributes. So my end result here > should be > john, tall > jack, short > A pyparsing solution may not be a speed demon to run, but doesn't take too long to write. Some short explanatory comments: - makeHTMLTags returns a tuple of opening and closing tags, but this example does not use any closing tags, so simpler to just discard them (only use zero'th return value) - Your example includes not only and tags, but also a tag, which is presumably ignorable. - The value returned from calling the searchString generator includes named fields for the different tag attributes, making it easy to access the name and value tag attributes. - The expression generated by makeHTMLTags will also handle tags with other surprising attributes that we didn't anticipate (such as "" or "") - Pyparsing leaves the values as "adj__tall__" and "adj__short__", but some simple string slicing gets us the data we want The pyparsing home page is at http://pyparsing.wikispaces.com. -- Paul from pyparsing import makeHTMLTags tag1 = makeHTMLTags("tag1")[0] tag2 = makeHTMLTags("tag2")[0] br = makeHTMLTags("br")[0] # define the pattern we're looking for, in terms of tag1 and tag2 # and specify that we wish to ignore tags patt = tag1 + tag2 patt.ignore(br) for tokens in patt.searchString(data): print "%s, %s" % (tokens.startTag1.name, tokens.startTag2.value[5:-2]) Prints: john, tall jack, short Printing tokens.dump() gives: ['tag1', ['name', 'jack'], True, 'tag2', ['value', 'adj__short__'], True] - empty: True - name: jack - startTag1: ['tag1', ['name', 'jack'], True] - empty: True - name: jack - startTag2: ['tag2', ['value', 'adj__short__'], True] - empty: True - value: adj__short__ - value: adj__short__ -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
On 2006-08-21, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hi, I am having some difficulty trying to create a regular expression. > > Consider: > > > > > > > Whenever a tag1 is followed by a tag 2, I want to retrieve the > values of the tag1:name and tag2:value attributes. So my end > result here should be > > john, tall > jack, short > > Ideas? It seems to me that an html parser might be a better solution. Here's a slapped-together example. It uses a simple state machine. from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.state = "get name" self.name_attrs = None self.result = {} def handle_starttag(self, tag, attrs): if self.state == "get name": if tag == "tag1": self.name_attrs = attrs self.state = "found name" elif self.state == "found name": if tag == "tag2": name = None for attr in self.name_attrs: if attr[0] == "name": name = attr[1] adj = None for attr in attrs: if attr[0] == "value" and attr[1][:3] == "adj": adj = attr[1][5:-2] if name == None or adj == None: print "Markup error: expected attributes missing." else: self.result[name] = adj self.state = "get name" elif tag == "tag1": # A new tag1 overrides the old one self.name_attrs = attrs p = MyHTMLParser() p.feed(""" """) print repr(p.result) p.close() There's probably a better way to search for attributes in attr than "for attr in attrs", but I didn't think of it, and the example I found on the net used the same idiom. The format of attrs seems strange. Why isn't it a dictionary? -- Neil Cerutti Sermon Outline: I. Delineate your fear II. Disown your fear III. Displace your rear --Church Bulletin Blooper -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
[EMAIL PROTECTED] wrote: > Hi, I am having some difficulty trying to create a regular expression. Steve, I find this tool is great for debugging regular expressions. http://kodos.sourceforge.net/ Just put some sample text in one window, your trial RE in another, and Kodos displays a wealth of information on what matches. Try it. - Paddy. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
[EMAIL PROTECTED] wrote: > Hi, I am having some difficulty trying to create a regular expression. > > Consider: > > > > > > > Whenever a tag1 is followed by a tag 2, I want to retrieve the values > of the tag1:name and tag2:value attributes. So my end result here > should be > john, tall > jack, short import re data = """ """ elems = re.findall("<(tag1|tag2)\s+(\w+)=\"([^\"]*)\"/>", data) for i in range(len(elems)-1): if elems[i][0] == "tag1" and elems[i+1][0] == "tag2": print elems[i][2], elems[i+1][2] -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
got zero results on this one :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
[EMAIL PROTECTED] wrote: > Thanks, i just tried it but I got the same result. > > I've been thinking about it for a few hours now and the problem with > this approach is that the .*? before the (?=tag2) may have matched a > tag1 and i don't know how to detect it. Maybe like this: 'tag1.+?name="(.+?)".*?(?:<)(?=tag2).*?="adj__(.*?)__' HTH, Rob -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
I am not expert of REs yet, this my first possible solution: import re txt = """ """ tfinder = r"""<# The opening < the tag to find \s* # Possible space or newline (tag[12]) # First subgroup, the identifier, tag1 or tag2 \s+ # There must be a space or newline or more (?:name|value) # Name or value, non-grouping \s* # Possible space or newline = # The = \s* # Possible space or newline " # Opening " ([^"]*)# Second subgroup, the tag string, it can't contain " " # Closing " of the string \s* # Possible space or newline /? # One optional ending / \s* # Possible space or newline ># The closing > of the tag ?# Greedy, match the first closing > """ patt = re.compile(tfinder, flags=re.I+re.X) prec_type = "" prec_string = "" for mobj in patt.finditer(txt): curr_type, curr_string = mobj.groups() if curr_type == "tag2" and prec_type == "tag1": print prec_string, curr_string.replace("adj__", "").strip("_") prec_type = curr_type prec_string = curr_string Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
Thanks, i just tried it but I got the same result. I've been thinking about it for a few hours now and the problem with this approach is that the .*? before the (?=tag2) may have matched a tag1 and i don't know how to detect it. And even if I could, how would I make the search reset its start position to the second tag1 it found? -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
[EMAIL PROTECTED] wrote: > Hi, I am having some difficulty trying to create a regular expression. > > Consider: > > > > > > > Whenever a tag1 is followed by a tag 2, I want to retrieve the values > of the tag1:name and tag2:value attributes. So my end result here > should be > john, tall > jack, short > > My low quality regexp > re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj__(.*?)__', > re.DOTALL) > > cannot handle the case where there is a tag1 that is not followed by a > tag2. findall returns > john, tall > joe, short > > Ideas? Have you tried this: 'tag1.+?name="(.+?)".*?(?=tag2).*?="adj__(.*?)__' ? HTH, Rob -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression question
Hi, I am having some difficulty trying to create a regular expression. Consider: Whenever a tag1 is followed by a tag 2, I want to retrieve the values of the tag1:name and tag2:value attributes. So my end result here should be john, tall jack, short My low quality regexp re.compile('tag1.+?name="(.+?)".*?(?!tag1).*?="adj__(.*?)__', re.DOTALL) cannot handle the case where there is a tag1 that is not followed by a tag2. findall returns john, tall joe, short Ideas? Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
Paul McGuire wrote: >> import re >> r=re.compile('[^"]+)"[^>]*>',re.IGNORECASE) >> for m in r.finditer(html): >> print m.group('image') >> > > Ouch - this fails to match any tag that has some other > attribute, such as "height" or "width", before the "src" attribute. > www.yahoo.com has several such tags. It also fails to match any image tag where the src attribute is quoted using single quotes, or where the src attribute is not enclosed in quotes at all. Handle all of that correctly in the regex and the beautiful soup or pyparsing options look even more attractive. In fact, if anyone can write a regex which matches the source attribute in a single named group, and correctly handles double, single and unquoted attributes, I'll admit to being impressed (and probably also slightly queasy when looking at it). Here's my best attempt at a regex that gets it right, but it still gets confused by other attributes if they contain spaces. >>> ATTR = '''[^\s=>]+(?:=(?:"[^">]*"|'[^'>]*'|[^"'\s>][^\s>]*))?''' >>> NOTSRC = '(?!src=)' + ATTR >>> PAT = '''(?<=")[^">]*|(?<=')[^'>]*|[^ >]*)''' >>> htmlPage = ''' ''' >>> for m in r.finditer(htmlPage): print m.group('image') fred.jpg freda.jpg >>> -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
"Frank Potter" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > pyparsing is cool. > but use only re is also OK > # -*- coding: UTF-8 -*- > import urllib2 > html=urllib2.urlopen(ur"http://www.yahoo.com/";).read() > > import re > r=re.compile('[^"]+)"[^>]*>',re.IGNORECASE) > for m in r.finditer(html): > print m.group('image') > Ouch - this fails to match any tag that has some other attribute, such as "height" or "width", before the "src" attribute. www.yahoo.com has several such tags. On the other hand, pyparsing's makeHTMLTags defines a starting tag expression that looks for (conceptually): < tagname ZeroOrMore(attrname '=' value) Optional('/') > and does not assume that the first tag is "src", or anything else for that matter. The returned results make the tag attributes accessible as object attributes or dictionary keys. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
pyparsing is cool. but use only re is also OK # -*- coding: UTF-8 -*- import urllib2 html=urllib2.urlopen(ur"http://www.yahoo.com/";).read() import re r=re.compile('[^"]+)"[^>]*>',re.IGNORECASE) for m in r.finditer(html): print m.group('image') I got these rusults: http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg http://us.i1.yimg.com/us.yimg.com/i/ww/beta/wthr.gif http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif On 6/8/06, Paul McGuire <[EMAIL PROTECTED]> wrote: > <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > Hi, > > I am new to python regular expression, I would like to use it to get an > > attribute of an html element from an html file? > > > > for example, I was able to read the html file using this: > >req = urllib2.Request(url=acaURL) > > f = urllib2.urlopen(req) > > > > data = f.read() > > > > my question is how can I just get the src attribute value of an img > > tag? > > something like this: > > (.*)(.*) > > > > I need to get the href of the image source. > > > > Thanks. > > > > As Fredrik pointed out, re's are not the only tool out there. Here's a > pyparsing solution. > > -- Paul > > > import pyparsing > import urllib > > # define HTML tag format using makeHTMLTags helper > # (we don't really care about the ending tag, > # even though makeHTMLTags returns definitions for both > # starting and ending tag patterns) > imgStartTag, dummy = pyparsing.makeHTMLTags("img") > > # get HTML source from some web site > htmlPage = urllib.urlopen("http://www.yahoo.com";) > htmlSource = htmlPage.read() > htmlPage.close() > > # scan HTML source, printing SRC attribute from each tag > for tokens,start,end in imgStartTag.scanString(htmlSource): > print tokens.src > > > Prints: > > http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif > http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif > http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/hea_0411.gif > http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/img_0607.jpg > http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg > http://us.i1.yimg.com/us.yimg.com/i/ww/beta/news/video.gif > http://us.i1.yimg.com/us.yimg.com/i/buzz/2006/06/wholefoodssmall.jpg > http://us.i1.yimg.com/us.yimg.com/i/mntl/msg/06q2/img_im.jpg > http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif > http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif > > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hi, > I am new to python regular expression, I would like to use it to get an > attribute of an html element from an html file? > > for example, I was able to read the html file using this: >req = urllib2.Request(url=acaURL) > f = urllib2.urlopen(req) > > data = f.read() > > my question is how can I just get the src attribute value of an img > tag? > something like this: > (.*)(.*) > > I need to get the href of the image source. > > Thanks. > As Fredrik pointed out, re's are not the only tool out there. Here's a pyparsing solution. -- Paul import pyparsing import urllib # define HTML tag format using makeHTMLTags helper # (we don't really care about the ending tag, # even though makeHTMLTags returns definitions for both # starting and ending tag patterns) imgStartTag, dummy = pyparsing.makeHTMLTags("img") # get HTML source from some web site htmlPage = urllib.urlopen("http://www.yahoo.com";) htmlSource = htmlPage.read() htmlPage.close() # scan HTML source, printing SRC attribute from each tag for tokens,start,end in imgStartTag.scanString(htmlSource): print tokens.src Prints: http://us.i1.yimg.com/us.yimg.com/i/ww/beta/edit_plink.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/125.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/13441.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/136.gif http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif http://us.i1.yimg.com/us.yimg.com/i/ww/v5_mail_t2.gif http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/hea_0411.gif http://us.i1.yimg.com/us.yimg.com/i/mntl/aut/06q2/img_0607.jpg http://us.i1.yimg.com/us.yimg.com/i/ww/news/2006/06/07/0607notorious_big.jpg http://us.i1.yimg.com/us.yimg.com/i/ww/beta/news/video.gif http://us.i1.yimg.com/us.yimg.com/i/buzz/2006/06/wholefoodssmall.jpg http://us.i1.yimg.com/us.yimg.com/i/mntl/msg/06q2/img_im.jpg http://us.i1.yimg.com/us.yimg.com/i/ww/trfc_bckt.gif http://us.i1.yimg.com/us.yimg.com/i/mntl/sh/04q2/camera.gif -- http://mail.python.org/mailman/listinfo/python-list
Re: Re: Regular Expression question
I'm sorry! I mean pattern is an argument of the function, in this case, how I process special charactors. patter = 'www.' # not this if re.compile(pattern).match(string) is not None: .. but not: if re.compile(r'www.').match(string) is not None: or if re.compile('www\.').match(string) is not None: , how you process special characters, like dot.Fredrik Lundh <[EMAIL PROTECTED]> wrote: [EMAIL PROTECTED] wrote:> I am new to python regular _expression, I would like to use it to get an> attribute of an html element from an html file?if you want to parse HTML, use an HTML parser. if you want to parse sloppy HTML, use a tolerant HTML parser:http://www.crummy.com/software/BeautifulSoup/-- http://mail.python.org/mailman/listinfo/python-list __赶快注册雅虎超大容量免费邮箱?http://cn.mail.yahoo.com-- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
[EMAIL PROTECTED] wrote: > I am new to python regular expression, I would like to use it to get an > attribute of an html element from an html file? if you want to parse HTML, use an HTML parser. if you want to parse sloppy HTML, use a tolerant HTML parser: http://www.crummy.com/software/BeautifulSoup/ -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression question
Hi, I am new to python regular expression, I would like to use it to get an attribute of an html element from an html file? for example, I was able to read the html file using this: req = urllib2.Request(url=acaURL) f = urllib2.urlopen(req) data = f.read() my question is how can I just get the src attribute value of an img tag? something like this: (.*)(.*) I need to get the href of the image source. Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression question
Michelle McCall wrote: >I have a script that needs to scan every line of a file for numerous > strings. There are groups of strings for each "area" of data we are looking > for. Looping through each of these list of strings separately for each line > has slowed execution to a crawl. Can I create ONE regular expression from a > group of strings such that when I perform a search on a line from the file > with this RE it will search the line for each one of the strings in the RE ? does m = re.search("spam|egg|bacon", line) do what you want? if you need all matches, you can use for m in re.finditer("spam|egg|bacon", line): ... if the strings are all literal strings (i.e. no subpatterns), a little preparation might speed things up: words = ["spam", "spim", "spum", "spamwall", "wallspam"] words.sort() # lexical order words.reverse() # look for longest match first pattern = "|".join(map(re.escape, words)) pattern = re.compile(pattern) for m in pattern.finditer(line): ... -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression question
I have a script that needs to scan every line of a file for numerous strings. There are groups of strings for each "area" of data we are looking for. Looping through each of these list of strings separately for each line has slowed execution to a crawl. Can I create ONE regular expression from a group of strings such that when I perform a search on a line from the file with this RE it will search the line for each one of the strings in the RE ? Michelle <>-- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
On Mon, 7 Nov 2005 16:38:11 -0800, James Stroud <[EMAIL PROTECTED]> wrote: >On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote: >> Ya, for some reason your non-greedy "?" doesn't seem to be taking. >> This works: >> >> re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) > >The non-greedy is actually acting as expected. This is because non-greedy >operators are "forward looking", not "backward looking". So the non-greedy >finds the start of the first start-of-the-match it comes accross and then >finds the first occurrence of '01' that makes the complete match, otherwise >the greedy operator would match .* as much as it could, gobbling up all '01's >before the last because these match '.*'. For example: > >py> rgx = re.compile(r"(00.*01) target_mark") >py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') >['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] >py> rgx = re.compile(r"(00.*?01) target_mark") >py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') >['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] > >My understanding is that backward looking operators are very resource >expensive to implement. > If the delimiting strings are fixed, we can use plain python string methods, e.g., (not tested beyond what you see ;-) >>> s = "00 noise1 01 noise2 00 target 01 target_mark" >>> def findit(s, beg='00', end='01', tmk=' target_mark'): ... start = 0 ... while True: ... t = s.find(tmk, start) ... if t<0: break ... start = s.rfind(beg, start, t) ... if start<0: break ... e = s.find(end, start, t) ... if e+len(end)==t: # _just_ after ... yield s[start:e+len(end)] ... start = t+len(tmk) ... >>> list(findit(s)) ['00 target 01'] >>> s2 = s + ' garbage noise3 00 almost 01 target_mark 00 success 01 >>> target_mark' >>> list(findit(s2)) ['00 target 01', '00 success 01'] (I didn't enforce exact adjacency the first time, obviously it would be more efficient to search for end+tmk instead of tmk and back to beg and forward to end ;-) If there can be spurious target_marks, and tricky matching spans, additional logic may be needed. Too lazy to think about it ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
On Monday 07 November 2005 17:31, Kent Johnson wrote: > James Stroud wrote: > > On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote: > >>Ya, for some reason your non-greedy "?" doesn't seem to be taking. > >>This works: > >> > >>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) > > > > The non-greedy is actually acting as expected. This is because non-greedy > > operators are "forward looking", not "backward looking". So the > > non-greedy finds the start of the first start-of-the-match it comes > > accross and then finds the first occurrence of '01' that makes the > > complete match, otherwise the greedy operator would match .* as much as > > it could, gobbling up all '01's before the last because these match '.*'. > > For example: > > > > py> rgx = re.compile(r"(00.*01) target_mark") > > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat > > 01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] > > py> rgx = re.compile(r"(00.*?01) target_mark") > > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat > > 01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] > > ??? not in my Python: > >>> rgx = re.compile(r"(00.*01) target_mark") > >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat > >>> 01') > > ['00 noise1 01 noise2 00 target 01'] > > >>> rgx = re.compile(r"(00.*?01) target_mark") > >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat > >>> 01') > > ['00 noise1 01 noise2 00 target 01'] > > Since target_mark only occurs once in the string the greedy and non-greedy > match is the same in this case. Somehow my cutting and pasting got messed up. It should be: py> rgx = re.compile(r"(00.*?01) target_mark") py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] py> rgx = re.compile(r"(00.*01) target_mark") py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] Sorry about that. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
James Stroud wrote: > On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote: > >>Ya, for some reason your non-greedy "?" doesn't seem to be taking. >>This works: >> >>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) > > > The non-greedy is actually acting as expected. This is because non-greedy > operators are "forward looking", not "backward looking". So the non-greedy > finds the start of the first start-of-the-match it comes accross and then > finds the first occurrence of '01' that makes the complete match, otherwise > the greedy operator would match .* as much as it could, gobbling up all '01's > before the last because these match '.*'. For example: > > py> rgx = re.compile(r"(00.*01) target_mark") > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') > ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] > py> rgx = re.compile(r"(00.*?01) target_mark") > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') > ['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] ??? not in my Python: >>> rgx = re.compile(r"(00.*01) target_mark") >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01'] >>> rgx = re.compile(r"(00.*?01) target_mark") >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01'] Since target_mark only occurs once in the string the greedy and non-greedy match is the same in this case. Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote: > Ya, for some reason your non-greedy "?" doesn't seem to be taking. > This works: > > re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) The non-greedy is actually acting as expected. This is because non-greedy operators are "forward looking", not "backward looking". So the non-greedy finds the start of the first start-of-the-match it comes accross and then finds the first occurrence of '01' that makes the complete match, otherwise the greedy operator would match .* as much as it could, gobbling up all '01's before the last because these match '.*'. For example: py> rgx = re.compile(r"(00.*01) target_mark") py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] py> rgx = re.compile(r"(00.*?01) target_mark") py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] My understanding is that backward looking operators are very resource expensive to implement. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
Ya, for some reason your non-greedy "?" doesn't seem to be taking. This works: re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
[EMAIL PROTECTED] wrote: > Hi, > > I'm having trouble extracting substrings using regular expression. Here > is my problem: > > Want to find the substring that is immediately before a given > substring. For example: from > "00 noise1 01 noise2 00 target 01 target_mark", > want to get > "00 target 01" > which is before > "target_mark". > My regular expression > "(00.*?01) target_mark" > will extract > "00 noise1 01 noise2 00 target 01". If there is a character that can't appear in the bit between the numbers then use everything-but-that instead of . - for example if spaces can only appear as you show them, use "(00 [^ ]* 01) target_mark" or "(00 \S* 01) target_mark" Kent -- http://mail.python.org/mailman/listinfo/python-list
Regular expression question -- exclude substring
Hi, I'm having trouble extracting substrings using regular expression. Here is my problem: Want to find the substring that is immediately before a given substring. For example: from "00 noise1 01 noise2 00 target 01 target_mark", want to get "00 target 01" which is before "target_mark". My regular expression "(00.*?01) target_mark" will extract "00 noise1 01 noise2 00 target 01". I'm thinking that the solution to my problem might be to use a regular expression to exclude the substring "target_mark", which will replace the part of ".*" above. However, I don't know how to exclude a substring. Can anyone help on this? Or maybe give another solution to my problem? Thanks very much. -- http://mail.python.org/mailman/listinfo/python-list
Re: Hopefully simple regular expression question
Thank you! I had totally forgot about that. It works. -- http://mail.python.org/mailman/listinfo/python-list
Re: Hopefully simple regular expression question
On 14 Jun 2005 04:01:58 -0700, rumours say that "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> might have written: >I want to match a word against a string such that 'peter' is found in >"peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or >"hey peterbe," because the word has to stand on its own. The following >code works for a single word: [snip] use \b before and after the word you search, for example: rePeter= re.compile("\bpeter\b", re.I) In the documentation for the re module, Subsection 4.2.1 is Regular Expression Syntax; it'll help a lot if you read it. Cheers. -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
Re: Hopefully simple regular expression question
On Tue, 14 Jun 2005 13:01:58 +0200, [EMAIL PROTECTED] wrote (in article <[EMAIL PROTECTED]>): > How do I modify my regular expression to match on expressions as well > as just single words?? import re def createStandaloneWordRegex(word): """ return a regular expression that can find 'peter' only if it's written alone (next to space, start of string, end of string, comma, etc) but not if inside another word like peterbe """ return re.compile(r'\b' + word + r'\b', re.I) def test_createStandaloneWordRegex(): def T(word, text): print createStandaloneWordRegex(word).findall(text) T("peter", "So Peter Bengtsson wrote this") T("peter", "peter") T("peter bengtsson", "So Peter Bengtsson wrote this") test_createStandaloneWordRegex() Works? -- http://mail.python.org/mailman/listinfo/python-list
Re: Hopefully simple regular expression question
[EMAIL PROTECTED] wrote: > I want to match a word against a string such that 'peter' is found in > "peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or > "hey peterbe," because the word has to stand on its own. The following > code works for a single word: > > def createStandaloneWordRegex(word): > """ return a regular expression that can find 'peter' only if it's > written > alone (next to space, start of string, end of string, comma, etc) > but > not if inside another word like peterbe """ > return re.compile(r""" > ( > ^ %s > (?=\W | $) > | > (?<=\W) > %s > (?=\W | $) > ) > """% (word, word), re.I|re.L|re.M|re.X) > > > def test_createStandaloneWordRegex(): > def T(word, text): > print createStandaloneWordRegex(word).findall(text) > > T("peter", "So Peter Bengtsson wrote this") > T("peter", "peter") > T("peter bengtsson", "So Peter Bengtsson wrote this") > > The result of running this is:: > > ['Peter'] > ['peter'] > [] <--- this is the problem!! > > > It works if the parameter is just one word (eg. 'peter') but stops > working when it's an expression (eg. 'peter bengtsson') No, not when it's an "expression" (whatever that means), but when the parameter contains whitespace, which is ignored in verbose mode. > > How do I modify my regular expression to match on expressions as well > as just single words?? > If you must stick with re.X, you must escape any whitespace characters in your "word" -- see re.escape(). Alternatively (1), drop re.X but this is ugly: regex_text_no_X = r"(^%s(?=\W|$)|(?<=\W)%s(?=\W|$))" % (word, word) Alternatively (2), consider using the \b gadget; this appears to give the same answers as the baroque method: regex_text_no_flab = r"\b%s\b" % word HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Hopefully simple regular expression question
I want to match a word against a string such that 'peter' is found in "peter bengtsson" or " hey peter," or but in "thepeter bengtsson" or "hey peterbe," because the word has to stand on its own. The following code works for a single word: def createStandaloneWordRegex(word): """ return a regular expression that can find 'peter' only if it's written alone (next to space, start of string, end of string, comma, etc) but not if inside another word like peterbe """ return re.compile(r""" ( ^ %s (?=\W | $) | (?<=\W) %s (?=\W | $) ) """% (word, word), re.I|re.L|re.M|re.X) def test_createStandaloneWordRegex(): def T(word, text): print createStandaloneWordRegex(word).findall(text) T("peter", "So Peter Bengtsson wrote this") T("peter", "peter") T("peter bengtsson", "So Peter Bengtsson wrote this") The result of running this is:: ['Peter'] ['peter'] [] <--- this is the problem!! It works if the parameter is just one word (eg. 'peter') but stops working when it's an expression (eg. 'peter bengtsson') How do I modify my regular expression to match on expressions as well as just single words?? -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression question
Bruno Desthuilliers wrote: >> match = STX + '(.*)' + ETX >> >> # Example 1 >> # This appears to work, but I'm not sure if the '+' is being used in >> the regular expression, or if it's just joining STX, '(.*)', and ETX. >> >> if re.search(STX + '(.*)' + ETX,data): >> print "Matches" >> >> # Example 2 >> # This also appears to work >> if re.search(match,data): >> print "Matches" > You may want something like: > if re.search('%s(.*)%s' % (STX, ETX), data): > ... that's of course the same thing as examples 1 and 2. a tip to the original poster: if you're not sure what an expression does, try printing the result. use "print repr(v)" if the value may contain odd characters. try adding this to your test script: print repr(match) print repr(STX + '(.*)' + ETX) print repr('%s(.*)%s' % (STX, ETX)) -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression question
> You may want something like: > if re.search('%s(.*)%s' % (STX, ETX), data): > Ah I didn't even think about that... Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: regular expression question
snacktime a écrit : The primary question is how do I perform a match when the regular expression contains string variables? For example, in the following code I want to match a line that starts with STX, then has any number of characters, then ends with STX. Example 2 I'm pretty sure works as I expect, but I'm not sure about Example 1, and I'm pretty sure about example 3. import re from curses.ascii import STX,ETX,FS STX = chr(STX) ETX = chr(ETX) FS = chr(FS) data = STX + "ONE" + FS + "TWO" + FS + "THREE" + ETX match = STX + '(.*)' + ETX # Example 1 # This appears to work, but I'm not sure if the '+' is being used in the regular expression, or if it's just joining STX, '(.*)', and ETX. if re.search(STX + '(.*)' + ETX,data): print "Matches" # Example 2 # This also appears to work if re.search(match,data): print "Matches" # Example 3 # Doesn't work, as STX and ETX are evaluated as the literal strings 'STX' and 'ETX' if re.search('STX(.*)ETX', data): print "Matches" You may want something like: if re.search('%s(.*)%s' % (STX, ETX), data): ... BTW, given your requirements, I'd write this: if re.search('^%s(.*)%s$' % (STX, ETX), data): ... Chris -- http://mail.python.org/mailman/listinfo/python-list
regular expression question
The primary question is how do I perform a match when the regular expression contains string variables? For example, in the following code I want to match a line that starts with STX, then has any number of characters, then ends with STX. Example 2 I'm pretty sure works as I expect, but I'm not sure about Example 1, and I'm pretty sure about example 3. import re from curses.ascii import STX,ETX,FS STX = chr(STX) ETX = chr(ETX) FS = chr(FS) data = STX + "ONE" + FS + "TWO" + FS + "THREE" + ETX match = STX + '(.*)' + ETX # Example 1 # This appears to work, but I'm not sure if the '+' is being used in the regular expression, or if it's just joining STX, '(.*)', and ETX. if re.search(STX + '(.*)' + ETX,data): print "Matches" # Example 2 # This also appears to work if re.search(match,data): print "Matches" # Example 3 # Doesn't work, as STX and ETX are evaluated as the literal strings 'STX' and 'ETX' if re.search('STX(.*)ETX', data): print "Matches" Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple (newbie) regular expression question
John Machin wrote: André Roberge wrote: Sorry for the simple question, but I find regular expressions rather intimidating. And I've never needed them before ... How would I go about to 'define' a regular expression that would identify strings like __alphanumerical__ as in __init__ (Just to spell things out, as I have seen underscores disappear from messages before, that's 2 underscores immediately followed by an alphanumerical string immediately followed by 2 underscore; in other words, a python 'private' method). Simple one-liner would be good. One-liner with explanation would be better. One-liner with explanation, and pointer to 'great tutorial' (for future reference) would probably be ideal. (I know, google is my friend for that last part. :-) Andre Firstly, some corrections: (1) google is your friend for _all_ parts of your question (2) Python has an initial P and doesn't have private methods. Read this: pat1 = r'__[A-Za-z0-9_]*__' pat2 = r'__\w*__' import re tests = ['x', '__', '', '_', '__!__', '__a__', '__Z__', '__8__', '__xyzzy__', '__plugh'] [x for x in tests if re.search(pat1, x)] ['', '_', '__a__', '__Z__', '__8__', '__xyzzy__'] [x for x in tests if re.search(pat2, x)] ['', '_', '__a__', '__Z__', '__8__', '__xyzzy__'] I've interpreted your question as meaning "valid Python identifier that starts and ends with two [implicitly, or more] underscores". In the two alternative patterns, the part in the middle says "zero or more instances of a character that can appear in the middle of a Python identifier". The first pattern spells this out as "capital letters, small letters, digits, and underscore". The second pattern uses the \w shorthand to give the same effect. You should be able to follow that from the Python documentation. Now, read this: http://www.amk.ca/python/howto/regex/ HTH, John Thanks for it all. It does help! André -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple (newbie) regular expression question
André Roberge wrote: > Sorry for the simple question, but I find regular > expressions rather intimidating. And I've never > needed them before ... > > How would I go about to 'define' a regular expression that > would identify strings like > __alphanumerical__ as in __init__ > (Just to spell things out, as I have seen underscores disappear > from messages before, that's 2 underscores immediately > followed by an alphanumerical string immediately followed > by 2 underscore; in other words, a python 'private' method). > > Simple one-liner would be good. > One-liner with explanation would be better. > > One-liner with explanation, and pointer to 'great tutorial' > (for future reference) would probably be ideal. > (I know, google is my friend for that last part. :-) > > Andre Firstly, some corrections: (1) google is your friend for _all_ parts of your question (2) Python has an initial P and doesn't have private methods. Read this: >>> pat1 = r'__[A-Za-z0-9_]*__' >>> pat2 = r'__\w*__' >>> import re >>> tests = ['x', '__', '', '_', '__!__', '__a__', '__Z__', '__8__', '__xyzzy__', '__plugh'] >>> [x for x in tests if re.search(pat1, x)] ['', '_', '__a__', '__Z__', '__8__', '__xyzzy__'] >>> [x for x in tests if re.search(pat2, x)] ['', '_', '__a__', '__Z__', '__8__', '__xyzzy__'] >>> I've interpreted your question as meaning "valid Python identifier that starts and ends with two [implicitly, or more] underscores". In the two alternative patterns, the part in the middle says "zero or more instances of a character that can appear in the middle of a Python identifier". The first pattern spells this out as "capital letters, small letters, digits, and underscore". The second pattern uses the \w shorthand to give the same effect. You should be able to follow that from the Python documentation. Now, read this: http://www.amk.ca/python/howto/regex/ HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Simple (newbie) regular expression question
Sorry for the simple question, but I find regular expressions rather intimidating. And I've never needed them before ... How would I go about to 'define' a regular expression that would identify strings like __alphanumerical__ as in __init__ (Just to spell things out, as I have seen underscores disappear from messages before, that's 2 underscores immediately followed by an alphanumerical string immediately followed by 2 underscore; in other words, a python 'private' method). Simple one-liner would be good. One-liner with explanation would be better. One-liner with explanation, and pointer to 'great tutorial' (for future reference) would probably be ideal. (I know, google is my friend for that last part. :-) Andre -- http://mail.python.org/mailman/listinfo/python-list
Re: OT: novice regular expression question
Oops! Sorry, didn't realize that. Thanks, "M.E.Farmer" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > > It's me wrote: > > The shlex.py needs quite a number of .py files. I tried to hunt down > a few > > of them and got really tire. > > > > Is there one batch of .py files that I can download from somewhere? > > > > Thanks, > Not sure what you mean by this. > Shlex is a standard library module. > It imports os and sys only, they are standard library modules. > If you have python you have them already. > If you mean cStringIO it is in the standard library(at least on my > system). > You dont have to use it just feed shlex an open file. > py>lexer = shlex.shlex(open('myrecord.txt', 'r')) > > Hth, > M.E.Farmer > -- http://mail.python.org/mailman/listinfo/python-list
Re: OT: novice regular expression question
It's me wrote: > The shlex.py needs quite a number of .py files. I tried to hunt down a few > of them and got really tire. > > Is there one batch of .py files that I can download from somewhere? > > Thanks, Not sure what you mean by this. Shlex is a standard library module. It imports os and sys only, they are standard library modules. If you have python you have them already. If you mean cStringIO it is in the standard library(at least on my system). You dont have to use it just feed shlex an open file. py>lexer = shlex.shlex(open('myrecord.txt', 'r')) Hth, M.E.Farmer -- http://mail.python.org/mailman/listinfo/python-list
Re: OT: novice regular expression question
The shlex.py needs quite a number of .py files. I tried to hunt down a few of them and got really tire. Is there one batch of .py files that I can download from somewhere? Thanks, "M.E.Farmer" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hello me, > Have you tried shlex.py it is a tokenizer for writing lexical > parsers. > Should be a breeze to whip something up with it. > an example of tokenizing: > py>import shlex > py># fake an open record > py>import cStringIO > py>myfakeRecord = cStringIO.StringIO() > py>myfakeRecord.write("['1','2'] \n 'fdfdfdfd' \n 'dfdfdfdfd' > ['1','2']\n") > py>myfakeRecord.seek(0) > py>lexer = shlex.shlex(myfakeRecord) > > py>lexer.get_token() > '[' > py>lexer.get_token() > '1' > py>lexer.get_token() > ',' > py>lexer.get_token() > '2' > py>lexer.get_token() > ']' > py>lexer.get_token() > 'fdfdfdfd' > > You can do a lot with it that is just a teaser. > M.E.Farmer > -- http://mail.python.org/mailman/listinfo/python-list
Re: OT: novice regular expression question
Hello me, Have you tried shlex.py it is a tokenizer for writing lexical parsers. Should be a breeze to whip something up with it. an example of tokenizing: py>import shlex py># fake an open record py>import cStringIO py>myfakeRecord = cStringIO.StringIO() py>myfakeRecord.write("['1','2'] \n 'fdfdfdfd' \n 'dfdfdfdfd' ['1','2']\n") py>myfakeRecord.seek(0) py>lexer = shlex.shlex(myfakeRecord) py>lexer.get_token() '[' py>lexer.get_token() '1' py>lexer.get_token() ',' py>lexer.get_token() '2' py>lexer.get_token() ']' py>lexer.get_token() 'fdfdfdfd' You can do a lot with it that is just a teaser. M.E.Farmer -- http://mail.python.org/mailman/listinfo/python-list
Re: OT: novice regular expression question
I'll chew on this. Thanks, got to go. "Steve Holden" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > It's me wrote: > > > I am never very good with regular expressions. My head always hurts > > whenever I need to use it. > > > Well, they are a pain to more than just you, and the conventional advice > is "even when you are convinced you need to use REs, try and find > another way". > > > I need to read a data file and parse each data record. Each item on the > > data record begins with either a string, or a list of strings. I searched > > around and didn't see any existing Python packages that does that. > > scanf.py, for instance, can do standard items but doesn't know about list. > > So, I figure I might have to write a lex engine for it and of course I have > > to deal wit RE again. > > > Well, you haven't yet convinced me that you *have* to. Personally, I > think you just like trouble :-) > > > But I run into problem right from the start. To recognize a list, I need a > > RE for the string: > > > > 1) begin with [" (left bracket followed by a double quote with zero or more > > spaces in between) > > 2) followed by any characters until ] but only if that left bracket is not > > preceeded by the escape character \. > > > So the pattern is > > 1. If the line begins with a "[" it should end with a "]" > > 2. Otherwise, it shouldn't? > > I'm trying to gently point out that the syntax you want to accept isn't > actually very clear. If the format is "Python strings and lists of > strings" then you might want to use the Python lexer to parse them, but > that's quite an advanced topic. [too advanced for me :-] > > The problem is matching "up to a right bracket not preceded by a > backslash". This seems to require what's technically referred to as a > "negative lookbehind assertion" - in other words, a pattern that doesn't > match anything, but checks that a specific condition is false or fails. > > > So, I tried: > > > > ^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*] > > > > and tested with: > > > > ["This line\] works"] > > > > but it fails with: > > > > ["This line fails"] > > > > I would have thought that: > > > >(\\\])* > > > > should work because it's zero or more incidence of the pattern \] > > > > Any help is greatly appreciated. > > > > Sorry for beign OT. I posted this question at the lex group and didn't get > > any response. I figure may be somebody would know around here. > > I'd start with baby steps. First of all, make sure that you can match > the individual strings. Then use that pattern, parenthesized to turn it > into a group, as a component in a more complex pattern. > > Do you want to treat "this is also \" a string" as an allowable string? > In that case you need a pattern that matches 'up to the first quotation > mark not preceded by a backslash" as well! > > Let's try matching a single string first: > > >>> s = re.compile(r'(".*?(? >>> s.match('"s1", "s2"').groups() > ('"s1"',) > > Note that I followed the "*" with a "?" to stop it being greedy, and > matching as many characters as it could. OK, does that work when we have > escaped quotation marks? > > >>> s.match(r'"s1\"\"", "s2"').groups() > ('"s1\\"\\""',) > > Apparently so. The negative lookbehind assertion stops a quote from > matching when it's preceded by a backslash. Can we match a > comma-separated list of such strings? > > >>> slpat = r'(".*?(? >>> s = re.compile(slpat) > > This is a bit trickier: here the second grouping beginning with "(?:" is > intended to ensure that only the strings that get matched are included > in the groups, not the separators, even though they must be grouped > together. The list *must* be separated by ", ", but you could alter the > pattern to allow zero or more whitespace characters. > > >>> s.match(r'"s1\"\"", "s2"').groups() > ('"s1\\"\\""', '"s2"') > > Well, that seems to work. Note that these patterns all ignore bracket > characters, so all you need to do now is to surround them with patterns > to match the opening and closing brackets, and you're done (I hope). > > Anyway, it'll give you a few ideas to work with. > > regards > Steve > -- > Steve Holden http://www.holdenweb.com/ > Python Web Programming http://pydish.holdenweb.com/ > Holden Web LLC +1 703 861 4237 +1 800 494 3119 -- http://mail.python.org/mailman/listinfo/python-list
Re: OT: novice regular expression question
check jgsoft dot com, they have2 things witch may help. Edit pad pro (the test version has a good tutorial) or power grep (if you do a lot of regexes, or the mastering regular expressions book from Orielly (if yo do a lot of regex work) Also the perl group would be good for regexes (pythons are Perl 5 compatable) -- http://mail.python.org/mailman/listinfo/python-list
Re: OT: novice regular expression question
It's me wrote: I am never very good with regular expressions. My head always hurts whenever I need to use it. Well, they are a pain to more than just you, and the conventional advice is "even when you are convinced you need to use REs, try and find another way". I need to read a data file and parse each data record. Each item on the data record begins with either a string, or a list of strings. I searched around and didn't see any existing Python packages that does that. scanf.py, for instance, can do standard items but doesn't know about list. So, I figure I might have to write a lex engine for it and of course I have to deal wit RE again. Well, you haven't yet convinced me that you *have* to. Personally, I think you just like trouble :-) But I run into problem right from the start. To recognize a list, I need a RE for the string: 1) begin with [" (left bracket followed by a double quote with zero or more spaces in between) 2) followed by any characters until ] but only if that left bracket is not preceeded by the escape character \. So the pattern is 1. If the line begins with a "[" it should end with a "]" 2. Otherwise, it shouldn't? I'm trying to gently point out that the syntax you want to accept isn't actually very clear. If the format is "Python strings and lists of strings" then you might want to use the Python lexer to parse them, but that's quite an advanced topic. [too advanced for me :-] The problem is matching "up to a right bracket not preceded by a backslash". This seems to require what's technically referred to as a "negative lookbehind assertion" - in other words, a pattern that doesn't match anything, but checks that a specific condition is false or fails. So, I tried: ^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*] and tested with: ["This line\] works"] but it fails with: ["This line fails"] I would have thought that: (\\\])* should work because it's zero or more incidence of the pattern \] Any help is greatly appreciated. Sorry for beign OT. I posted this question at the lex group and didn't get any response. I figure may be somebody would know around here. I'd start with baby steps. First of all, make sure that you can match the individual strings. Then use that pattern, parenthesized to turn it into a group, as a component in a more complex pattern. Do you want to treat "this is also \" a string" as an allowable string? In that case you need a pattern that matches 'up to the first quotation mark not preceded by a backslash" as well! Let's try matching a single string first: >>> s = re.compile(r'(".*?(?>> s.match('"s1", "s2"').groups() ('"s1"',) Note that I followed the "*" with a "?" to stop it being greedy, and matching as many characters as it could. OK, does that work when we have escaped quotation marks? >>> s.match(r'"s1\"\"", "s2"').groups() ('"s1\\"\\""',) Apparently so. The negative lookbehind assertion stops a quote from matching when it's preceded by a backslash. Can we match a comma-separated list of such strings? >>> slpat = r'(".*?(?>> s = re.compile(slpat) This is a bit trickier: here the second grouping beginning with "(?:" is intended to ensure that only the strings that get matched are included in the groups, not the separators, even though they must be grouped together. The list *must* be separated by ", ", but you could alter the pattern to allow zero or more whitespace characters. >>> s.match(r'"s1\"\"", "s2"').groups() ('"s1\\"\\""', '"s2"') Well, that seems to work. Note that these patterns all ignore bracket characters, so all you need to do now is to surround them with patterns to match the opening and closing brackets, and you're done (I hope). Anyway, it'll give you a few ideas to work with. regards Steve -- Steve Holden http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ Holden Web LLC +1 703 861 4237 +1 800 494 3119 -- http://mail.python.org/mailman/listinfo/python-list
OT: novice regular expression question
I am never very good with regular expressions. My head always hurts whenever I need to use it. I need to read a data file and parse each data record. Each item on the data record begins with either a string, or a list of strings. I searched around and didn't see any existing Python packages that does that. scanf.py, for instance, can do standard items but doesn't know about list. So, I figure I might have to write a lex engine for it and of course I have to deal wit RE again. But I run into problem right from the start. To recognize a list, I need a RE for the string: 1) begin with [" (left bracket followed by a double quote with zero or more spaces in between) 2) followed by any characters until ] but only if that left bracket is not preceeded by the escape character \. So, I tried: ^\[[" "]*"[a-z,A-Z\,, ]*(\\\])*[a-z,A-Z\,, \"]*] and tested with: ["This line\] works"] but it fails with: ["This line fails"] I would have thought that: (\\\])* should work because it's zero or more incidence of the pattern \] Any help is greatly appreciated. Sorry for beign OT. I posted this question at the lex group and didn't get any response. I figure may be somebody would know around here. -- http://mail.python.org/mailman/listinfo/python-list