Re: doubling the number of tests, but not taking twice as long
On Wed, Jul 18, 2018 at 7:59 PM, MRAB wrote: > On 2018-07-18 22:40, Larry Martell wrote: >> >> On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti wrote: >>> >>> On 2018-07-16, Larry Martell wrote: I had some code that did this: meas_regex = '_M\d+_' meas_re = re.compile(meas_regex) if meas_re.search(filename): stuff1() else: stuff2() I then had to change it to this: if meas_re.search(filename): if 'MeasDisplay' in filename: stuff1a() else: stuff1() else: if 'PatternFov' in filename: stuff2a() else: stuff2() This code needs to process many tens of 1000's of files, and it runs often, so it needs to run very fast. Needless to say, my change has made it take 2x as long. Can anyone see a way to improve that? >>> >>> >>> Can you expand/improve the regex pattern so you don't have rescan >>> the string to check for the presence of MeasDisplay and >>> PatternFov? In other words, since you're already using the giant, >>> Swiss Army sledgehammer of the re module, go ahead and use enough >>> features to cover your use case. >> >> >> Yeah, that was my first thought, but I haven't been able to come up >> with a regex that works. >> >> There are 4 cases I need to detect: >> >> case1 = 'spam_M123_eggs_MeasDisplay_sausage' >> case2 = 'spam_M123_eggs_sausage_and_spam' >> case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam' >> case4 = 'spam_spam_spam_eggs_sausage_and_spam' >> >> I thought this regex would work: >> >> '(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}' >> >> And then I could look at the match objects and see which of the 4 >> cases it was. But try as I might, I could not get it to work. Any >> regex gurus want to tell me what I am doing wrong here? >> > The trick to capturing both of the parts when they are both optional is to > use a lookahead and make it optional: > > r'(?=.*?(_M\d+_))?(?=.*?(MeasDisplay|PatternFov))?' Wow! Thanks so much. This works perfectly. I don't understand it, but I will spend some time dissecting it and I will add another tool to my arsenal. -- https://mail.python.org/mailman/listinfo/python-list
Re: doubling the number of tests, but not taking twice as long
On 2018-07-18 22:40, Larry Martell wrote: On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti wrote: On 2018-07-16, Larry Martell wrote: I had some code that did this: meas_regex = '_M\d+_' meas_re = re.compile(meas_regex) if meas_re.search(filename): stuff1() else: stuff2() I then had to change it to this: if meas_re.search(filename): if 'MeasDisplay' in filename: stuff1a() else: stuff1() else: if 'PatternFov' in filename: stuff2a() else: stuff2() This code needs to process many tens of 1000's of files, and it runs often, so it needs to run very fast. Needless to say, my change has made it take 2x as long. Can anyone see a way to improve that? Can you expand/improve the regex pattern so you don't have rescan the string to check for the presence of MeasDisplay and PatternFov? In other words, since you're already using the giant, Swiss Army sledgehammer of the re module, go ahead and use enough features to cover your use case. Yeah, that was my first thought, but I haven't been able to come up with a regex that works. There are 4 cases I need to detect: case1 = 'spam_M123_eggs_MeasDisplay_sausage' case2 = 'spam_M123_eggs_sausage_and_spam' case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam' case4 = 'spam_spam_spam_eggs_sausage_and_spam' I thought this regex would work: '(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}' And then I could look at the match objects and see which of the 4 cases it was. But try as I might, I could not get it to work. Any regex gurus want to tell me what I am doing wrong here? The trick to capturing both of the parts when they are both optional is to use a lookahead and make it optional: r'(?=.*?(_M\d+_))?(?=.*?(MeasDisplay|PatternFov))?' -- https://mail.python.org/mailman/listinfo/python-list
Re: doubling the number of tests, but not taking twice as long
On 18Jul2018 17:40, Larry Martell wrote: On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti wrote: On 2018-07-16, Larry Martell wrote: I had some code that did this: meas_regex = '_M\d+_' meas_re = re.compile(meas_regex) if meas_re.search(filename): stuff1() else: stuff2() I then had to change it to this: if meas_re.search(filename): if 'MeasDisplay' in filename: stuff1a() else: stuff1() else: if 'PatternFov' in filename: stuff2a() else: stuff2() This code needs to process many tens of 1000's of files, and it runs often, so it needs to run very fast. Needless to say, my change has made it take 2x as long. Can anyone see a way to improve that? As others have mentioned, your stuff*() function must be doing very little work, because I'd expect the regexp stuff to be fairly quick. Yeah, that was my first thought, but I haven't been able to come up with a regex that works. There are 4 cases I need to detect: case1 = 'spam_M123_eggs_MeasDisplay_sausage' case2 = 'spam_M123_eggs_sausage_and_spam' case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam' case4 = 'spam_spam_spam_eggs_sausage_and_spam' I thought this regex would work: '(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}' Did you try making that a raw string: r'(..}' to avoid mangling the backslashes (which Python will interpret before they get to the regexp parser)? Print meas_regex to check it got past Python intact. Just print(meas_regex). Also, "{0,1}" is usually written "?". And then I could look at the match objects and see which of the 4 cases it was. But try as I might, I could not get it to work. Any regex gurus want to tell me what I am doing wrong here? Backslashes aside, it looks ok to me. So I'd better run it... Code: from __future__ import print_function import re case1 = 'spam_M123_eggs_MeasDisplay_sausage' case2 = 'spam_M123_eggs_sausage_and_spam' case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam' case4 = 'spam_spam_spam_eggs_sausage_and_spam' meas_regex = r'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}' print("meas_regex =", meas_regex) meas_re = re.compile(meas_regex) for case in case1, case2, case3, case4: print(case, end=" ") m = meas_re.search(case) if m: print("MATCH: group1 =", m.group(1), "group2 =", m.group(2)) else: print("NO MATCH") Output: meas_regex = (_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1} spam_M123_eggs_MeasDisplay_sausage MATCH: group1 = None group2 = None spam_M123_eggs_sausage_and_spam MATCH: group1 = None group2 = None spam_spam_spam_PatternFov_eggs_sausage_and_spam MATCH: group1 = None group2 = None spam_spam_spam_eggs_sausage_and_spam MATCH: group1 = None group2 = None Ah, and there's the problem. Though I'm surprised to get the Nones in the .group()s instead of the empty string; possibly that reflects "0 occurences". [...] A little testing with other tweaks to the regexp supports that. No matter. To your problem: When you write "(_M\d+_){0,1}" or anything that is optional like that, it can match the empty string (the "0"). And that _always_ matches. Likewise the second part of the pattern. Because you want to know about _both_ the "M\d+_" _and_ the "MeasDisplay|PatternFOV" you can't put them both in the same pattern: if you make them optional, the pattern always matches the empty string even if the target is later on; if you make them mandatory (no "{0,1}") your pattern will only work when both are present. Similar pitfalls apply for any combination, making one optional and the other mandatory: you can't do all 4 possibilities (niether, just the first, just the second, both) with one regex (== one match/search test). So your code was already optimal. I am surprised that your program took twice a long to run with your doubled test though. These are filenames, yes? So shouldn't the stuff*() functions be openin the file or something: I would expect that to dominate the runtime and your extra name testing to not be the slowdown. What's going on inside the stuff*() functions? Might they also have become more complex with your new cases? Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: doubling the number of tests, but not taking twice as long
On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti wrote: > On 2018-07-16, Larry Martell wrote: >> I had some code that did this: >> >> meas_regex = '_M\d+_' >> meas_re = re.compile(meas_regex) >> >> if meas_re.search(filename): >> stuff1() >> else: >> stuff2() >> >> I then had to change it to this: >> >> if meas_re.search(filename): >> if 'MeasDisplay' in filename: >> stuff1a() >> else: >> stuff1() >> else: >> if 'PatternFov' in filename: >> stuff2a() >>else: >> stuff2() >> >> This code needs to process many tens of 1000's of files, and it >> runs often, so it needs to run very fast. Needless to say, my >> change has made it take 2x as long. Can anyone see a way to >> improve that? > > Can you expand/improve the regex pattern so you don't have rescan > the string to check for the presence of MeasDisplay and > PatternFov? In other words, since you're already using the giant, > Swiss Army sledgehammer of the re module, go ahead and use enough > features to cover your use case. Yeah, that was my first thought, but I haven't been able to come up with a regex that works. There are 4 cases I need to detect: case1 = 'spam_M123_eggs_MeasDisplay_sausage' case2 = 'spam_M123_eggs_sausage_and_spam' case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam' case4 = 'spam_spam_spam_eggs_sausage_and_spam' I thought this regex would work: '(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}' And then I could look at the match objects and see which of the 4 cases it was. But try as I might, I could not get it to work. Any regex gurus want to tell me what I am doing wrong here? -- https://mail.python.org/mailman/listinfo/python-list
Re: doubling the number of tests, but not taking twice as long
On 2018-07-16, Larry Martell wrote: > I had some code that did this: > > meas_regex = '_M\d+_' > meas_re = re.compile(meas_regex) > > if meas_re.search(filename): > stuff1() > else: > stuff2() > > I then had to change it to this: > > if meas_re.search(filename): > if 'MeasDisplay' in filename: > stuff1a() > else: > stuff1() > else: > if 'PatternFov' in filename: > stuff2a() >else: > stuff2() > > This code needs to process many tens of 1000's of files, and it > runs often, so it needs to run very fast. Needless to say, my > change has made it take 2x as long. Can anyone see a way to > improve that? Can you expand/improve the regex pattern so you don't have rescan the string to check for the presence of MeasDisplay and PatternFov? In other words, since you're already using the giant, Swiss Army sledgehammer of the re module, go ahead and use enough features to cover your use case. -- Neil Cerutti -- https://mail.python.org/mailman/listinfo/python-list
Re: doubling the number of tests, but not taking twice as long
Larry Martell wrote: > I had some code that did this: > > meas_regex = '_M\d+_' > meas_re = re.compile(meas_regex) > > if meas_re.search(filename): > stuff1() > else: > stuff2() > > I then had to change it to this: > > if meas_re.search(filename): > if 'MeasDisplay' in filename: > stuff1a() > else: > stuff1() > else: > if 'PatternFov' in filename: > stuff2a() >else: > stuff2() > > This code needs to process many tens of 1000's of files, and it runs > often, so it needs to run very fast. Needless to say, my change has > made it take 2x as long. That is *not* self-evident. Usually stuffX() would take much longer than the initial tests. So the first step would be to verify that if meas_re.search(filename): if 'MeasDisplay' in filename: pass else: pass else: if 'PatternFov' in filename: pass else: pass takes a significant amount of the total time the piece of code you give takes to execute. > Can anyone see a way to improve that? Not really. I'd check if there is a branch that is executed most of the time or that takes much longer to execute than the other ones, and then try to optimize that. -- https://mail.python.org/mailman/listinfo/python-list
Re: doubling the number of tests, but not taking twice as long
On Mon, Jul 16, 2018 at 6:01 PM, Gilmeh Serda wrote: > On Mon, 16 Jul 2018 14:17:57 -0400, Larry Martell wrote: > >> This code needs to process many tens of 1000's of files, and it runs >> often, so it needs to run very fast. Needless to say, my change has made >> it take 2x as long. Can anyone see a way to improve that? > > Don't use RegEx search? > > My version 361, and a simple benchmarking thing, tells me it's about 2.7 > times slower than "if ... in ..." on 1,000,000 loops. Without the regex how would you suggest I search for '_M\d+_' efficiently? -- https://mail.python.org/mailman/listinfo/python-list
Re: doubling the number of tests, but not taking twice as long
Op 2018-07-16, Larry Martell schreef : > I had some code that did this: > > meas_regex = '_M\d+_' > meas_re = re.compile(meas_regex) > > if meas_re.search(filename): > stuff1() > else: > stuff2() > > I then had to change it to this: > > if meas_re.search(filename): > if 'MeasDisplay' in filename: > stuff1a() > else: > stuff1() > else: > if 'PatternFov' in filename: > stuff2a() >else: > stuff2() > > This code needs to process many tens of 1000's of files, and it runs > often, so it needs to run very fast. Needless to say, my change has > made it take 2x as long. It's not at all obvious to me. Did you actually measure it? Seems to depend strongly on what stuff1a and stuff2a are doing. > Can anyone see a way to improve that? Use multiprocessing.Pool to exploit multiple CPUs? Stephan -- https://mail.python.org/mailman/listinfo/python-list