Re: Regular expression question -- exclude substring
On Mon, 7 Nov 2005 16:38:11 -0800, James Stroud <[EMAIL PROTECTED]> wrote: >On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote: >> Ya, for some reason your non-greedy "?" doesn't seem to be taking. >> This works: >> >> re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) > >The non-greedy is actually acting as expected. This is because non-greedy >operators are "forward looking", not "backward looking". So the non-greedy >finds the start of the first start-of-the-match it comes accross and then >finds the first occurrence of '01' that makes the complete match, otherwise >the greedy operator would match .* as much as it could, gobbling up all '01's >before the last because these match '.*'. For example: > >py> rgx = re.compile(r"(00.*01) target_mark") >py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') >['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] >py> rgx = re.compile(r"(00.*?01) target_mark") >py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') >['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] > >My understanding is that backward looking operators are very resource >expensive to implement. > If the delimiting strings are fixed, we can use plain python string methods, e.g., (not tested beyond what you see ;-) >>> s = "00 noise1 01 noise2 00 target 01 target_mark" >>> def findit(s, beg='00', end='01', tmk=' target_mark'): ... start = 0 ... while True: ... t = s.find(tmk, start) ... if t<0: break ... start = s.rfind(beg, start, t) ... if start<0: break ... e = s.find(end, start, t) ... if e+len(end)==t: # _just_ after ... yield s[start:e+len(end)] ... start = t+len(tmk) ... >>> list(findit(s)) ['00 target 01'] >>> s2 = s + ' garbage noise3 00 almost 01 target_mark 00 success 01 >>> target_mark' >>> list(findit(s2)) ['00 target 01', '00 success 01'] (I didn't enforce exact adjacency the first time, obviously it would be more efficient to search for end+tmk instead of tmk and back to beg and forward to end ;-) If there can be spurious target_marks, and tricky matching spans, additional logic may be needed. Too lazy to think about it ;-) Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
On Monday 07 November 2005 17:31, Kent Johnson wrote: > James Stroud wrote: > > On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote: > >>Ya, for some reason your non-greedy "?" doesn't seem to be taking. > >>This works: > >> > >>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) > > > > The non-greedy is actually acting as expected. This is because non-greedy > > operators are "forward looking", not "backward looking". So the > > non-greedy finds the start of the first start-of-the-match it comes > > accross and then finds the first occurrence of '01' that makes the > > complete match, otherwise the greedy operator would match .* as much as > > it could, gobbling up all '01's before the last because these match '.*'. > > For example: > > > > py> rgx = re.compile(r"(00.*01) target_mark") > > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat > > 01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] > > py> rgx = re.compile(r"(00.*?01) target_mark") > > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat > > 01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] > > ??? not in my Python: > >>> rgx = re.compile(r"(00.*01) target_mark") > >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat > >>> 01') > > ['00 noise1 01 noise2 00 target 01'] > > >>> rgx = re.compile(r"(00.*?01) target_mark") > >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat > >>> 01') > > ['00 noise1 01 noise2 00 target 01'] > > Since target_mark only occurs once in the string the greedy and non-greedy > match is the same in this case. Somehow my cutting and pasting got messed up. It should be: py> rgx = re.compile(r"(00.*?01) target_mark") py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] py> rgx = re.compile(r"(00.*01) target_mark") py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] Sorry about that. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
James Stroud wrote: > On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote: > >>Ya, for some reason your non-greedy "?" doesn't seem to be taking. >>This works: >> >>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) > > > The non-greedy is actually acting as expected. This is because non-greedy > operators are "forward looking", not "backward looking". So the non-greedy > finds the start of the first start-of-the-match it comes accross and then > finds the first occurrence of '01' that makes the complete match, otherwise > the greedy operator would match .* as much as it could, gobbling up all '01's > before the last because these match '.*'. For example: > > py> rgx = re.compile(r"(00.*01) target_mark") > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') > ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] > py> rgx = re.compile(r"(00.*?01) target_mark") > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') > ['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] ??? not in my Python: >>> rgx = re.compile(r"(00.*01) target_mark") >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01'] >>> rgx = re.compile(r"(00.*?01) target_mark") >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01'] Since target_mark only occurs once in the string the greedy and non-greedy match is the same in this case. Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote: > Ya, for some reason your non-greedy "?" doesn't seem to be taking. > This works: > > re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) The non-greedy is actually acting as expected. This is because non-greedy operators are "forward looking", not "backward looking". So the non-greedy finds the start of the first start-of-the-match it comes accross and then finds the first occurrence of '01' that makes the complete match, otherwise the greedy operator would match .* as much as it could, gobbling up all '01's before the last because these match '.*'. For example: py> rgx = re.compile(r"(00.*01) target_mark") py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01'] py> rgx = re.compile(r"(00.*?01) target_mark") py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01'] My understanding is that backward looking operators are very resource expensive to implement. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
Ya, for some reason your non-greedy "?" doesn't seem to be taking. This works: re.sub('(.*)(00.*?01) target_mark', r'\2', your_string) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular expression question -- exclude substring
[EMAIL PROTECTED] wrote: > Hi, > > I'm having trouble extracting substrings using regular expression. Here > is my problem: > > Want to find the substring that is immediately before a given > substring. For example: from > "00 noise1 01 noise2 00 target 01 target_mark", > want to get > "00 target 01" > which is before > "target_mark". > My regular expression > "(00.*?01) target_mark" > will extract > "00 noise1 01 noise2 00 target 01". If there is a character that can't appear in the bit between the numbers then use everything-but-that instead of . - for example if spaces can only appear as you show them, use "(00 [^ ]* 01) target_mark" or "(00 \S* 01) target_mark" Kent -- http://mail.python.org/mailman/listinfo/python-list