Re: doubling the number of tests, but not taking twice as long

2018-07-18 Thread Larry Martell
On Wed, Jul 18, 2018 at 7:59 PM, MRAB  wrote:
> On 2018-07-18 22:40, Larry Martell wrote:
>>
>> On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti  wrote:
>>>
>>> On 2018-07-16, Larry Martell  wrote:

 I had some code that did this:

 meas_regex = '_M\d+_'
 meas_re = re.compile(meas_regex)

 if meas_re.search(filename):
 stuff1()
 else:
 stuff2()

 I then had to change it to this:

 if meas_re.search(filename):
 if 'MeasDisplay' in filename:
 stuff1a()
 else:
 stuff1()
 else:
 if 'PatternFov' in filename:
 stuff2a()
else:
 stuff2()

 This code needs to process many tens of 1000's of files, and it
 runs often, so it needs to run very fast. Needless to say, my
 change has made it take 2x as long. Can anyone see a way to
 improve that?
>>>
>>>
>>> Can you expand/improve the regex pattern so you don't have rescan
>>> the string to check for the presence of MeasDisplay and
>>> PatternFov? In other words, since you're already using the giant,
>>> Swiss Army sledgehammer of the re module, go ahead and use enough
>>> features to cover your use case.
>>
>>
>> Yeah, that was my first thought, but I haven't been able to come up
>> with a regex that works.
>>
>> There are 4 cases I need to detect:
>>
>> case1 = 'spam_M123_eggs_MeasDisplay_sausage'
>> case2 = 'spam_M123_eggs_sausage_and_spam'
>> case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
>> case4 = 'spam_spam_spam_eggs_sausage_and_spam'
>>
>> I thought this regex would work:
>>
>> '(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'
>>
>> And then I could look at the match objects and see which of the 4
>> cases it was. But try as I might, I could not get it to work. Any
>> regex gurus want to tell me what I am doing wrong here?
>>
> The trick to capturing both of the parts when they are both optional is to
> use a lookahead and make it optional:
>
> r'(?=.*?(_M\d+_))?(?=.*?(MeasDisplay|PatternFov))?'

Wow! Thanks so much. This works perfectly. I don't understand it, but
I will spend some time dissecting it and I will add another tool to my
arsenal.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: doubling the number of tests, but not taking twice as long

2018-07-18 Thread MRAB

On 2018-07-18 22:40, Larry Martell wrote:

On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti  wrote:

On 2018-07-16, Larry Martell  wrote:

I had some code that did this:

meas_regex = '_M\d+_'
meas_re = re.compile(meas_regex)

if meas_re.search(filename):
stuff1()
else:
stuff2()

I then had to change it to this:

if meas_re.search(filename):
if 'MeasDisplay' in filename:
stuff1a()
else:
stuff1()
else:
if 'PatternFov' in filename:
stuff2a()
   else:
stuff2()

This code needs to process many tens of 1000's of files, and it
runs often, so it needs to run very fast. Needless to say, my
change has made it take 2x as long. Can anyone see a way to
improve that?


Can you expand/improve the regex pattern so you don't have rescan
the string to check for the presence of MeasDisplay and
PatternFov? In other words, since you're already using the giant,
Swiss Army sledgehammer of the re module, go ahead and use enough
features to cover your use case.


Yeah, that was my first thought, but I haven't been able to come up
with a regex that works.

There are 4 cases I need to detect:

case1 = 'spam_M123_eggs_MeasDisplay_sausage'
case2 = 'spam_M123_eggs_sausage_and_spam'
case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
case4 = 'spam_spam_spam_eggs_sausage_and_spam'

I thought this regex would work:

'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'

And then I could look at the match objects and see which of the 4
cases it was. But try as I might, I could not get it to work. Any
regex gurus want to tell me what I am doing wrong here?

The trick to capturing both of the parts when they are both optional is 
to use a lookahead and make it optional:


r'(?=.*?(_M\d+_))?(?=.*?(MeasDisplay|PatternFov))?'
--
https://mail.python.org/mailman/listinfo/python-list


Re: doubling the number of tests, but not taking twice as long

2018-07-18 Thread Cameron Simpson

On 18Jul2018 17:40, Larry Martell  wrote:

On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti  wrote:

On 2018-07-16, Larry Martell  wrote:

I had some code that did this:

meas_regex = '_M\d+_'
meas_re = re.compile(meas_regex)

if meas_re.search(filename):
stuff1()
else:
stuff2()

I then had to change it to this:

if meas_re.search(filename):
if 'MeasDisplay' in filename:
stuff1a()
else:
stuff1()
else:
if 'PatternFov' in filename:
stuff2a()
   else:
stuff2()

This code needs to process many tens of 1000's of files, and it
runs often, so it needs to run very fast. Needless to say, my
change has made it take 2x as long. Can anyone see a way to
improve that?


As others have mentioned, your stuff*() function must be doing very little 
work, because I'd expect the regexp stuff to be fairly quick.



Yeah, that was my first thought, but I haven't been able to come up
with a regex that works.

There are 4 cases I need to detect:

case1 = 'spam_M123_eggs_MeasDisplay_sausage'
case2 = 'spam_M123_eggs_sausage_and_spam'
case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
case4 = 'spam_spam_spam_eggs_sausage_and_spam'

I thought this regex would work:

'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'


Did you try making that a raw string:

 r'(..}'

to avoid mangling the backslashes (which Python will interpret before they get 
to the regexp parser)?


Print meas_regex to check it got past Python intact. Just print(meas_regex).

Also, "{0,1}" is usually written "?".


And then I could look at the match objects and see which of the 4
cases it was. But try as I might, I could not get it to work. Any
regex gurus want to tell me what I am doing wrong here?


Backslashes aside, it looks ok to me. So I'd better run it... Code:

   from __future__ import print_function
   import re

   case1 = 'spam_M123_eggs_MeasDisplay_sausage'
   case2 = 'spam_M123_eggs_sausage_and_spam'
   case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
   case4 = 'spam_spam_spam_eggs_sausage_and_spam'

   meas_regex = r'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'
   print("meas_regex =", meas_regex)

   meas_re = re.compile(meas_regex)

   for case in case1, case2, case3, case4:
 print(case, end=" ")
 m = meas_re.search(case)
 if m:
   print("MATCH: group1 =", m.group(1), "group2 =", m.group(2))
 else:
   print("NO MATCH")

Output:

   meas_regex = (_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}
   spam_M123_eggs_MeasDisplay_sausage MATCH: group1 = None group2 = None
   spam_M123_eggs_sausage_and_spam MATCH: group1 = None group2 = None
   spam_spam_spam_PatternFov_eggs_sausage_and_spam MATCH: group1 = None group2 
= None
   spam_spam_spam_eggs_sausage_and_spam MATCH: group1 = None group2 = None

Ah, and there's the problem. Though I'm surprised to get the Nones in the 
.group()s instead of the empty string; possibly that reflects "0 occurences".  
[...] A little testing with other tweaks to the regexp supports that. No 
matter. To your problem:


When you write "(_M\d+_){0,1}" or anything that is optional like that, it can 
match the empty string (the "0"). And that _always_ matches.


Likewise the second part of the pattern.

Because you want to know about _both_ the "M\d+_" _and_ the 
"MeasDisplay|PatternFOV" you can't put them both in the same pattern: if you 
make them optional, the pattern always matches the empty string even if the 
target is later on; if you make them mandatory (no "{0,1}") your pattern will 
only work when both are present.


Similar pitfalls apply for any combination, making one optional and the other 
mandatory: you can't do all 4 possibilities (niether, just the first, just the 
second, both) with one regex (== one match/search test).


So your code was already optimal.

I am surprised that your program took twice a long to run with your doubled 
test though. These are filenames, yes? So shouldn't the stuff*() functions be 
openin the file or something: I would expect that to dominate the runtime and 
your extra name testing to not be the slowdown.


What's going on inside the stuff*() functions? Might they also have become more 
complex with your new cases?


Cheers,
Cameron Simpson 
--
https://mail.python.org/mailman/listinfo/python-list


Re: doubling the number of tests, but not taking twice as long

2018-07-18 Thread Larry Martell
On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti  wrote:
> On 2018-07-16, Larry Martell  wrote:
>> I had some code that did this:
>>
>> meas_regex = '_M\d+_'
>> meas_re = re.compile(meas_regex)
>>
>> if meas_re.search(filename):
>> stuff1()
>> else:
>> stuff2()
>>
>> I then had to change it to this:
>>
>> if meas_re.search(filename):
>> if 'MeasDisplay' in filename:
>> stuff1a()
>> else:
>> stuff1()
>> else:
>> if 'PatternFov' in filename:
>> stuff2a()
>>else:
>> stuff2()
>>
>> This code needs to process many tens of 1000's of files, and it
>> runs often, so it needs to run very fast. Needless to say, my
>> change has made it take 2x as long. Can anyone see a way to
>> improve that?
>
> Can you expand/improve the regex pattern so you don't have rescan
> the string to check for the presence of MeasDisplay and
> PatternFov? In other words, since you're already using the giant,
> Swiss Army sledgehammer of the re module, go ahead and use enough
> features to cover your use case.

Yeah, that was my first thought, but I haven't been able to come up
with a regex that works.

There are 4 cases I need to detect:

case1 = 'spam_M123_eggs_MeasDisplay_sausage'
case2 = 'spam_M123_eggs_sausage_and_spam'
case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
case4 = 'spam_spam_spam_eggs_sausage_and_spam'

I thought this regex would work:

'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'

And then I could look at the match objects and see which of the 4
cases it was. But try as I might, I could not get it to work. Any
regex gurus want to tell me what I am doing wrong here?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: doubling the number of tests, but not taking twice as long

2018-07-17 Thread Neil Cerutti
On 2018-07-16, Larry Martell  wrote:
> I had some code that did this:
>
> meas_regex = '_M\d+_'
> meas_re = re.compile(meas_regex)
>
> if meas_re.search(filename):
> stuff1()
> else:
> stuff2()
>
> I then had to change it to this:
>
> if meas_re.search(filename):
> if 'MeasDisplay' in filename:
> stuff1a()
> else:
> stuff1()
> else:
> if 'PatternFov' in filename:
> stuff2a()
>else:
> stuff2()
>
> This code needs to process many tens of 1000's of files, and it
> runs often, so it needs to run very fast. Needless to say, my
> change has made it take 2x as long. Can anyone see a way to
> improve that?

Can you expand/improve the regex pattern so you don't have rescan
the string to check for the presence of MeasDisplay and
PatternFov? In other words, since you're already using the giant,
Swiss Army sledgehammer of the re module, go ahead and use enough
features to cover your use case.

-- 
Neil Cerutti

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: doubling the number of tests, but not taking twice as long

2018-07-17 Thread Peter Otten
Larry Martell wrote:

> I had some code that did this:
> 
> meas_regex = '_M\d+_'
> meas_re = re.compile(meas_regex)
> 
> if meas_re.search(filename):
> stuff1()
> else:
> stuff2()
> 
> I then had to change it to this:
> 
> if meas_re.search(filename):
> if 'MeasDisplay' in filename:
> stuff1a()
> else:
> stuff1()
> else:
> if 'PatternFov' in filename:
> stuff2a()
>else:
> stuff2()
> 
> This code needs to process many tens of 1000's of files, and it runs
> often, so it needs to run very fast. Needless to say, my change has
> made it take 2x as long. 

That is *not* self-evident. Usually stuffX() would take much longer than the 
initial tests.

So the first step would be to verify that

if meas_re.search(filename):
if 'MeasDisplay' in filename:
pass
else:
pass
else:
if 'PatternFov' in filename:
pass
   else:
pass

takes a significant amount of the total time the piece of code you give 
takes to execute.

> Can anyone see a way to improve that?

Not really. I'd check if there is a branch that is executed most of the time 
or that takes much longer to execute than the other ones, and then try to 
optimize that.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: doubling the number of tests, but not taking twice as long

2018-07-16 Thread Larry Martell
On Mon, Jul 16, 2018 at 6:01 PM, Gilmeh Serda
 wrote:
> On Mon, 16 Jul 2018 14:17:57 -0400, Larry Martell wrote:
>
>> This code needs to process many tens of 1000's of files, and it runs
>> often, so it needs to run very fast. Needless to say, my change has made
>> it take 2x as long. Can anyone see a way to improve that?
>
> Don't use RegEx search?
>
> My version 361, and a simple benchmarking thing, tells me it's about 2.7
> times slower than "if ... in ..." on 1,000,000 loops.

Without the regex how would you suggest I search for '_M\d+_' efficiently?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: doubling the number of tests, but not taking twice as long

2018-07-16 Thread Stephan Houben
Op 2018-07-16, Larry Martell schreef :
> I had some code that did this:
>
> meas_regex = '_M\d+_'
> meas_re = re.compile(meas_regex)
>
> if meas_re.search(filename):
> stuff1()
> else:
> stuff2()
>
> I then had to change it to this:
>
> if meas_re.search(filename):
> if 'MeasDisplay' in filename:
> stuff1a()
> else:
> stuff1()
> else:
> if 'PatternFov' in filename:
> stuff2a()
>else:
> stuff2()
>
> This code needs to process many tens of 1000's of files, and it runs
> often, so it needs to run very fast. Needless to say, my change has
> made it take 2x as long. 

It's not at all obvious to me.  Did you actually measure it?
Seems to depend strongly on what stuff1a and stuff2a are doing.

> Can anyone see a way to improve that?

Use multiprocessing.Pool to exploit multiple CPUs?

Stephan
-- 
https://mail.python.org/mailman/listinfo/python-list