Re: Regular expression question -- exclude substring

2005-11-07 Thread Bengt Richter
On Mon, 7 Nov 2005 16:38:11 -0800, James Stroud <[EMAIL PROTECTED]> wrote:

>On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote:
>> Ya, for some reason your non-greedy "?" doesn't seem to be taking.
>> This works:
>>
>> re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
>
>The non-greedy is actually acting as expected. This is because non-greedy 
>operators are "forward looking", not "backward looking". So the non-greedy 
>finds the start of the first start-of-the-match it comes accross and then 
>finds the first occurrence of '01' that makes the complete match, otherwise 
>the greedy operator would match .* as much as it could, gobbling up all '01's 
>before the last because these match '.*'. For example:
>
>py> rgx = re.compile(r"(00.*01) target_mark")
>py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
>['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
>py> rgx = re.compile(r"(00.*?01) target_mark")
>py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
>['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
>
>My understanding is that backward looking operators are very resource 
>expensive to implement.
>
If the delimiting strings are fixed, we can use plain python string methods, 
e.g.,
(not tested beyond what you see ;-)

 >>> s = "00 noise1 01 noise2 00 target 01 target_mark"

 >>> def findit(s, beg='00', end='01', tmk=' target_mark'):
 ... start = 0
 ... while True:
 ... t = s.find(tmk, start)
 ... if t<0: break
 ... start = s.rfind(beg, start, t)
 ... if start<0: break
 ... e = s.find(end, start, t)
 ... if e+len(end)==t: # _just_ after
 ... yield s[start:e+len(end)]
 ... start = t+len(tmk)
 ...
 >>> list(findit(s))
 ['00 target 01']
 >>> s2 = s + ' garbage noise3 00 almost 01  target_mark 00 success 01 
 >>> target_mark'
 >>> list(findit(s2))
 ['00 target 01', '00 success 01']

(I didn't enforce exact adjacency the first time, obviously it would be more 
efficient
to search for end+tmk instead of tmk and back to beg and forward to end ;-)

If there can be spurious target_marks, and tricky matching spans, additional 
logic may be needed.
Too lazy to think about it ;-)

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread James Stroud
On Monday 07 November 2005 17:31, Kent Johnson wrote:
> James Stroud wrote:
> > On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote:
> >>Ya, for some reason your non-greedy "?" doesn't seem to be taking.
> >>This works:
> >>
> >>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
> >
> > The non-greedy is actually acting as expected. This is because non-greedy
> > operators are "forward looking", not "backward looking". So the
> > non-greedy finds the start of the first start-of-the-match it comes
> > accross and then finds the first occurrence of '01' that makes the
> > complete match, otherwise the greedy operator would match .* as much as
> > it could, gobbling up all '01's before the last because these match '.*'.
> > For example:
> >
> > py> rgx = re.compile(r"(00.*01) target_mark")
> > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
> > 01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
> > py> rgx = re.compile(r"(00.*?01) target_mark")
> > py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
> > 01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
>
> ??? not in my Python:
>  >>> rgx = re.compile(r"(00.*01) target_mark")
>  >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>  >>> 01')
>
> ['00 noise1 01 noise2 00 target 01']
>
>  >>> rgx = re.compile(r"(00.*?01) target_mark")
>  >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
>  >>> 01')
>
> ['00 noise1 01 noise2 00 target 01']
>
> Since target_mark only occurs once in the string the greedy and non-greedy
> match is the same in this case.

Somehow my cutting and pasting got messed up. It should be:

py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 
target_mark')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 
target_mark')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']

Sorry about that.

James

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread Kent Johnson
James Stroud wrote:
> On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote:
> 
>>Ya, for some reason your non-greedy "?" doesn't seem to be taking.
>>This works:
>>
>>re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
> 
> 
> The non-greedy is actually acting as expected. This is because non-greedy 
> operators are "forward looking", not "backward looking". So the non-greedy 
> finds the start of the first start-of-the-match it comes accross and then 
> finds the first occurrence of '01' that makes the complete match, otherwise 
> the greedy operator would match .* as much as it could, gobbling up all '01's 
> before the last because these match '.*'. For example:
> 
> py> rgx = re.compile(r"(00.*01) target_mark")
> py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
> ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
> py> rgx = re.compile(r"(00.*?01) target_mark")
> py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
> ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

??? not in my Python:
 >>> rgx = re.compile(r"(00.*01) target_mark")
 >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01']
 >>> rgx = re.compile(r"(00.*?01) target_mark")
 >>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy 
match is the same in this case.

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread James Stroud
On Monday 07 November 2005 16:18, [EMAIL PROTECTED] wrote:
> Ya, for some reason your non-greedy "?" doesn't seem to be taking.
> This works:
>
> re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy 
operators are "forward looking", not "backward looking". So the non-greedy 
finds the start of the first start-of-the-match it comes accross and then 
finds the first occurrence of '01' that makes the complete match, otherwise 
the greedy operator would match .* as much as it could, gobbling up all '01's 
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource 
expensive to implement.

James

-- 
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread google
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular expression question -- exclude substring

2005-11-07 Thread Kent Johnson
[EMAIL PROTECTED] wrote:
> Hi,
> 
> I'm having trouble extracting substrings using regular expression. Here
> is my problem:
> 
> Want to find the substring that is immediately before a given
> substring. For example: from
> "00 noise1 01 noise2 00 target 01 target_mark",
> want to get
> "00 target 01"
> which is before
> "target_mark".
> My regular expression
> "(00.*?01) target_mark"
> will extract
> "00 noise1 01 noise2 00 target 01".

If there is a character that can't appear in the bit between the numbers then 
use everything-but-that instead of . - for example if spaces can only appear as 
you show them, use
"(00 [^ ]* 01) target_mark" or
"(00 \S* 01) target_mark"

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list