Re: Seeking regex optimizer
Mirco Wahab wrote: > Hi, are you the A.Dalke from the Schulten group (VMD) as > listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi Yes. But I left there nearly a decade ago. > # naive regex '\d+9' > # find some number only if it ends by 9 > my $str="10099000"; > > print "non-atomic backtracks: $1\n" if $str =~ / ( \d+9 )/x; > print "atomic: matches $1\n" if $str =~ / ( (?>\d+)9 )/x; What makes me less than enthusiastic about various new regexp features is that they can be written other ways. For example, >>> import re >>> pat = re.compile("([0-8]*9)+") >>> pat.match("10099000") <_sre.SRE_Match object at 0x590a0> >>> _.group(0) '10099' >>> pat = re.compile("([0-8]*9)+(?!\d)") >>> pat.match("10099000") >>> They are different ways to reach the same goal. Understanding the alternatives requires some non-trivial understanding about how regexp engines work. The diversity of solutions with no clear preference of one over the other overly promotes idiosyncratic patterns and requires that everyone reading the regexps must understand all the special syntax contructions even when a person's working vocabulary only uses a subset of those. > Second line would not backtrack and not find the 9 behind \d+ > because atomic grouping prevents that, so line 1 matches > 10099 and line 2 nothing (but very fast, which is the point). While my pattern is not as fast. It's linearly slower than yours as it does the backtracking. (Unless the engine is very smart.) That factor is not something I'm concerned with. I can also get performance without backtracking (though there is setup for backtracking) using >>> pat = re.compile("([0-8]*9)+([0-8]*)") >>> m = pat.match("10099000") >>> if m.group(2): print "does not end in a 9" ... does not end in a 9 >>> >>> m = pat.match("100990009a") >>> if m.group(2): print "does not end in a 9" ... >>> > Why is there a named group count > restriction in Python? Any reasons > one should know? While not definite, I think it introduces an ambiguity with octal escape codes. (*That's* something to omit in Python 3000!) Here's the quote from perlre There is no limit to the number of captured substrings that you may use. However Perl also uses \10, \11, etc. as aliases for \010, \011, etc. (Recall that 0 means octal, so \011 is the character at number 9 in your coded character set; which would be the 10th character, a hori- zontal tab under ASCII.) Perl resolves this ambiguity by interpreting \10 as a backreference only if at least 10 left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences. Python doesn't have "\10", "\11", etc. as aliases for "\010", "\011", etc. >>> re.compile("\10") <_sre.SRE_Pattern object at 0x575c0> >>> pat = re.compile(r"\10") Traceback (most recent call last): File "", line 1, in ? File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py", line 180, in compile return _compile(pattern, flags) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py", line 227, in _compile raise error, v # invalid expression sre_constants.error: bogus escape: '\\10' >>> pat = re.compile(r"\010") >>> so there is no ambiguity until reaching \100. >>> pat = re.compile(r"\100") >>> pat.match("@") <_sre.SRE_Match object at 0x232560> >>> Most likely /F (who implemented the regex engine) decided to refuse the temptation to guess the meaning, eg by parens counting as Perl does. Try this variation of your program to see how Perl changes the meaning of "\100" based on the input string for (my $i=98; $i<=103; $i++) { my $re = '(\d+)\D' x $i . '\100'; my $text = join ' ', 0..$i; $text .= " @"; print "Testing $i\n"; if ($text =~ /$re/) { print " Number of matches: $+\n"; print " Last group: ${$i}\n"; } else { print "No matches\n"; } } I get Testing 98 Number of matches: 98 Last group: 98 Testing 99 Number of matches: 99 Last group: 99 Testing 100 No matches Testing 101 No matches Testing 102 No matches Testing 103 No matches Andrew [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Thus spoke [EMAIL PROTECTED] (on 2006-06-20 01:39): Hi, are you the A.Dalke from the Schulten group (VMD) as listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi > Replying to me Mirco Wahab wrote: >> If you pull the strings into (?>( ... )) (atomic groups), >> this would't happen. > > Given that Python's re engine doesn't support this feature > it doesn't really help the original poster's problem. Yeah, right. I misunderstood the problem - where it was meant to be 'literally' a problem of capture group id's. I took it metaphorical, which meant to me 'problems by backtracking' (which was wrong). > Even if some future Python did support it, the limit > to 100 named groups is unaffected by backtracking. Right for that one, what I had in mind was something like the following: # naive regex '\d+9' # find some number only if it ends by 9 my $str="10099000"; print "non-atomic backtracks: $1\n" if $str =~ / ( \d+9 )/x; print "atomic: matches $1\n" if $str =~ / ( (?>\d+)9 )/x; Second line would not backtrack and not find the 9 behind \d+ because atomic grouping prevents that, so line 1 matches 10099 and line 2 nothing (but very fast, which is the point). import re re.compile("(.)"*100) > Traceback (most recent call last): > ... > ... > AssertionError: sorry, but this version only supports 100 named groups I tried: my $re = '(\d+)\D' x 1; my $text = join ' ', 0..1; print $+ if $text =~ /$re/; print "\n", $1001; which prints: 1000 Why is there a named group count restriction in Python? Any reasons one should know? Regards & thanks Mirco -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Kay Schluehr replied to my question: > > Why do you want to use a regex for this? > > Because it is part of a tokenizer that already uses regexps and I do > not intend to rewrite / replace it. Switching to pytst is not a big change - there will be little impact on the rest of your code. On the other hand, it does change your build and deployment process because of the need for another package. Because you are constraining yourself to regex there isn't much of an improvement you can do. > I found that certain groups of token > might be represented in a more compact mannerr: for matching ['+', > '++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if > there is some generic approach to solve the "inverse regexp" problem in > a non-trivial fashion. There are a few questions which come up. Why do you want to have a "more compact" representation? Do you think it will make the result faster? Likely it will because of the reduced need for backtracking but unless you have a lot of strings with the same prefix then the number of backtracks should go as something like the log of the number of words, which scales pretty well. Are you speed limited by the existing implementation? Likely not. In which case the performance isn't an issue. Consider this anecdote from http://www.technicat.com/writing/programming.html I was asked in a phone interview with Google how I would search for a number in an array of ordered numbers. Obviously, the questioner was asking for a CS 101 answer - binary search. But in real life, I would probably do the "wrong" thing - search the array from beginning to end. There's no point in taking twice the time to write twice as much code that has to be maintained and debugged if the application performance is good enough, and in particular if that piece of code is not the bottleneck in the application. (And I seriously doubt you'd have that data laid out linearly in a fixed array like that if it was the bottleneck) As it turns out, the regexp implementation could do the optimization you want. That is, it could recognize that the input sequence is a set of fixed words (no special characters) and implement something like Aho-Corasick, in which case you wouldn't need to change your input. I don't know if Python's regexp implementation does this already. Neither do you. Test the functionality and performance first before going starting other work. I am not the first to give you this advice: John Machin: > Optimised with respect to what? speed? ease of construction? gry@ll.mit.edu: > As others have suggested, you should first try the most naive > implementation before making a hard optimization problem out of this. As for the general question of the existance of a "generic approach to solve the "inverse regexp" problem in a non-trivial fashion." Yes, there is. Kinda of. All (formal/theoretical) regular expressions can be written as a regular expression without need for backtracking. If you look up the theory of regular expressions, backtracking occurs when there are ambiguities in traversing the state machine. These are called nondeterministic finite state machines, or NFAs. It turns out that every NFA can be written as a DFA, which has no need for backtracking. Except. Actually, two excepts. The normal algorithm for converting an NFA to a DFA doesn't support grouping, which you'll need to figure out which group matched. Wikipedia says there is an alternate, slower algorithm which does support groups. The second except is that regexps in Python, Perl, etc. are not regular expressions. Some patterns, like r"([A-Z][a-z]+)\1", which matches "WikiWiki", are not "regular grammars" under the definition used in language theory. It's possible to use regexps to match strings of prime length, for example. So in the general case there is no solution. There are many possible optimizations, but no general solution. For your problem, which is a limited subset of the problem as it involves only constant strings, then the solution is a prefix tree, also called a trie. See http://en.wikipedia.org/wiki/Prefix_tree John Machin in an earlier post suggested you search along these lines when he said > I would have thought the way to approach this would be a simple > character-by-character tree/trie-driven lookup. This would be worst case > O(n) where n is the length of the longest string in your list. There may > well be a Python-callable gadget for this on the net somewhere. Google > "Danny Yoo ahocorasick" for a Python-callable solution to a similar but > more complex problem. For that matter I suggested it when I pointed you to pytst. The tst means "ternary search tree" (some write it "ternary search trie") which is a different implementation of the same idea. Your posts read like you ignored those pointers because they didn't fall into the category "implemented with a regular expression". If that's the cas
Re: Seeking regex optimizer
[EMAIL PROTECTED] wrote: > Kay Schluehr wrote: > > I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a > > regular expression sx from it, such that sx.match(s) yields a SRE_Match > > object when s starts with an s_i for one i in [0,...,n]. > > Why do you want to use a regex for this? Because it is part of a tokenizer that already uses regexps and I do not intend to rewrite / replace it. Certain groups of token ( operators, braces and special characters ) should be user extensible. All others will stay as they are. I found that certain groups of token might be represented in a more compact mannerr: for matching ['+', '++'] one might generate '\+|\+\+' or '\+\+?' and I wanted to know if there is some generic approach to solve the "inverse regexp" problem in a non-trivial fashion. Regards, Kay -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Replying to me Mirco Wahab wrote: > If you pull the strings into (?>( ... )) (atomic groups), > this would't happen. Given that Python's re engine doesn't support this feature it doesn't really help the original poster's problem. Even if some future Python did support it, the limit to 100 named groups is unaffected by backtracking. >>> import re >>> re.compile("(.)"*100) Traceback (most recent call last): File "", line 1, in ? File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py", line 180, in compile return _compile(pattern, flags) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py", line 225, in _compile p = sre_compile.compile(pattern, flags) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre_compile.py", line 506, in compile raise AssertionError( AssertionError: sorry, but this version only supports 100 named groups >>> There was no backtracking in "(.)"*100. Andrew [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Thus spoke [EMAIL PROTECTED] (on 2006-06-19 22:51): > It uses Aho-Corasick for the implementation which is fast and does what > you expect it to do. Nor does it have a problem of matching more than > 99 possible strings as the regexp approach may have. If you pull the strings into (?>( ... )) (atomic groups), this would't happen. http://www.regular-expressions.info/atomic.html ... Everything between (?>) is treated as one single token by the regex engine, once the regex engine leaves the group. Because the entire group is one token, no backtracking can take place once the regex engine has found a match for the group. ... Maybe Py.2.5 will have them? Regards Mrico -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Kay Schluehr wrote: > I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a > regular expression sx from it, such that sx.match(s) yields a SRE_Match > object when s starts with an s_i for one i in [0,...,n]. Why do you want to use a regex for this? When you have constant strings there are other solutions. For example, this uses Nicolas Lehuen's pytst module from http://nicolas.lehuen.com/index.php/Pytst >>> import tst >>> tree = tst.TST() >>> tree["aa"] = (1, "String with 'aa'") >>> tree["aahhh"] = (2, "At the doctor's") >>> tree["a"] = (3, "indefinite article") >>> tree.scan("This is a bunch of text. Have you seen an aardvark?", ... tst.TupleListAction()) [('This is ', -8, None), ('a', 1, (3, 'indefinite article')), (' bunch of text. H', -18, None), ('a', 1, (3, 'indefinite article')), ('ve you seen ', -12, None), ('a', 1, (3, 'indefinite article')), ('n ', -2, None), ('aa', 2, (1, "String with 'aa'")), ('rdv', -3, None), ('a', 1, (3, 'indefinite article')), ('rk?', -3, None)] >>> Each returned tuple has three elements: For a substring which is not a match these are: - the unmatched substring - the negative length of the substring - None For a substring which matches: - the matched substring - the positive length of the substring - the value of the match in the TST (the tree) It uses Aho-Corasick for the implementation which is fast and does what you expect it to do. Nor does it have a problem of matching more than 99 possible strings as the regexp approach may have. Andrew [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Kay Schluehr wrote: > I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a > regular expression sx from it, such that sx.match(s) yields a SRE_Match > object when s starts with an s_i for one i in [0,...,n]. There might > be relations between those strings: s_k.startswith(s_1) -> True or > s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa', > ...,'...ab']. For this reason SRE_Match should provide the longest > possible match. In a very similar case I used a simple tree of dictionaries, one node per letter, to represent the strings. This naturally collapses cases like ['a','aa','aaa']. Then a recursive function finds the desired prefix. This was WAY faster than the "re" module for large n (tradeoff point for me was n~1000). It requires a bit more coding, but I think it is the natural data structure for this problem. As others have suggested, you should first try the most naive implementation before making a hard optimization problem out of this. > Is there a Python module able to create an optimized regex rx from ls > for the given constraints? > > Regards, > Kay -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
On 19/06/2006 7:06 PM, Kay Schluehr wrote: > Mirco, > > with "special characters" I mentioned control characters of regular > expressions i.e. one of ".^$()?[]{}\|+*" but not non ascii-127 > characters. > > For a workaround you simply have to "mangle" those using an escape > control character: > > REGEXCHAR = ".^$()?[]{}\\|+*" > def mangle(s): > pattern = [] > for c in s: > if c in REGEXCHAR: > pattern.append("\\") > pattern.append(c) >return "".join(pattern) > What's wrong with re.escape()? Have you not read (a) my response to Paddy's first posting (b) the manual? -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Mirco, with "special characters" I mentioned control characters of regular expressions i.e. one of ".^$()?[]{}\|+*" but not non ascii-127 characters. For a workaround you simply have to "mangle" those using an escape control character: REGEXCHAR = ".^$()?[]{}\\|+*" def mangle(s): pattern = [] for c in s: if c in REGEXCHAR: pattern.append("\\") pattern.append(c) return "".join(pattern) Regards, Kay -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Kay Schluehr wrote: > with reverse sorting as in your proposal.The naive solution is easy to > generate but I'm sceptical about its cost effectiveness. On the other > hand I do not want to investigate this matter if somebody else already > did it thoroughly. > > Regards, > Kay Hi Kay, The only way to know if something is optimised enough, is for you to test it in your application. If you haven't finished your application, then don't sweat it. take a note of your options, stick one in, then get the application finished. Its only when its finished that you can really say if further optimisations are necessary, and you would have to profile the complete prog to see where it's spending its time. I usually find myself using string methods where possible, then regular expressions only for things that string methods cannot do. When I finish my script I usually find that Pythons speed is adequate for most of my text processing tasks, or if speed IS an issue, then i needed to profile the completed application anyway. - Pad. -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Thus spoke Kay Schluehr (on 2006-06-18 19:07): > I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a > regular expression sx from it, such that sx.match(s) yields a SRE_Match > object when s starts with an s_i for one i in [0,...,n]. There might > be relations between those strings: s_k.startswith(s_1) -> True or > s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa', > ...,'...ab']. For this reason SRE_Match should provide the longest > possible match. With some ideas from Kay and Paddy, it tried to get along with Python in doing this. If its allowed to spread the individual strings into alterations, the following would imho do: #!/usr/bin/python # -*- coding: iso-8859-15 -*- text = r'this is a text containing aaaöüöaaaµaaa and more'; lstr = [ 'a', 'aa', 'a', 'aaaöüöaaaµaaa', 'aaab' ] import re pat = re.compile(\ '(' + \ '|'.join(sorted(lstr,lambda a,b: len(b)-len(a))) + \ ')', re.VERBOSE); hits = sorted( pat.findall(text), lambda a,b: len(b)-len(a) ) print 'Longest: ', hits[0] This will print 'aaaöüöaaaµaaa' from the text and won't complain about specuial characters used. in Perl, you could build up the regex by dynamic evaluation (??{..}), but I didn't manage to get this working in Python, so here is (in Perl) how I thougt it would work: my $text = "this is a text containing aaaöüöaaaµaaa and more"; my @lstr = ( 'a', 'aa', 'a', 'aaaöüöaaaµaaa', 'aaab', ); my $re = qr{ (??{ join '|', map { quotemeta } sort{ length $b <=> length $a } @lstr }) }x; $_ = $text; print "Longest: ", (my $hit) = reverse sort /$re/g; Maybe the experts can bring some light to it. Regards Mirco -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
On 19/06/2006 6:30 AM, Paddy wrote: > Kay Schluehr wrote: >> I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a >> regular expression sx from it, such that sx.match(s) yields a SRE_Match >> object when s starts with an s_i for one i in [0,...,n]. There might >> be relations between those strings: s_k.startswith(s_1) -> True or >> s_k.endswith(s_1) -> True. Kay, what is the relevance of one string being a suffix of another? I don't see how that could affect the outcome. An extreme case would be ls = ['a', 'aa', >> ...,'...ab']. For this reason SRE_Match should provide the longest >> possible match. >> >> Is there a Python module able to create an optimized regex rx from ls >> for the given constraints? Optimised with respect to what? speed? ease of construction? I presume that you will ensure that the members of the list are unique. Note that the Python regex engine will consider each candidate in Paddy's solution left to right until it gets a match or reaches the end (that's why the reverse sort is needed to get longest possible match). This is worst-case O(N) where N is the total of the lengths of all the strings in your list. As far as I can see, this is the only basic solution (modulo one or two twiddles -- see below) using Python's re. You could possibly consider producing "zzz|foo(?:bar)?|aaa" instead of "zzz|foobar|foo|aaa" -- but whether that would run sufficiently faster to offset the cost of construction is anybody's guess. How many strings in your list? Average/maximum length? Likelihood of ls[i].startswith(ls[j]) == True? unicode or str? Your requirements are rather constrained: "sx.match(s) yields a SRE_Match object" ... why do you need this? Surely all you need is matched_len (which may be zero) such that s[:matched_len] is the matched prefix. I would have thought the way to approach this would be a simple character-by-character tree/trie-driven lookup. This would be worst case O(n) where n is the length of the longest string in your list. There may well be a Python-callable gadget for this on the net somewhere. Google "Danny Yoo ahocorasick" for a Python-callable solution to a similar but more complex problem. A cheap hack using Python's re: divide the problem according to first character: prefix_dict_match = { 'a': re.compile('alpaca|alligator').match, 'f': re.compile('foobar|foo').match, 'z': re.compile('zoo|zebra').match, } if s and s[0] in prefix_dict_match: match_obj = prefix_dict_match[s[0]](s) else: match_obj = None >> >> Regards, >> Kay > > A start would be: > regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")" > But the above does not work if you have special characters in your > strings. Paddy, fixing that problem, and "optimising" by removing the redundant ^() metacharacters: regexp = "|".join(map(re.escape, sorted(ls, reverse=True))) Hoping some of this helps, Regards, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Paddy wrote: > Kay Schluehr wrote: > > I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a > > regular expression sx from it, such that sx.match(s) yields a SRE_Match > > object when s starts with an s_i for one i in [0,...,n]. There might > > be relations between those strings: s_k.startswith(s_1) -> True or > > s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa', > > ...,'...ab']. For this reason SRE_Match should provide the longest > > possible match. > > > > Is there a Python module able to create an optimized regex rx from ls > > for the given constraints? > > > > Regards, > > Kay > > A start would be: > regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")" > But the above does not work if you have special characters in your > strings. For special characters there might be a workaround using escapes. This is indeed important, but I do think one should split the problem into separate parts. > You say you want something that is optimised. What have have you tried? Sorting the list and checking for successor inclusions. Say you have ls = ['x','a', 'aa', 'aab' ,'ab'] This can be mapped to: 'x|a(?:(?:ab)?|b?|a?)' or to: '^(x|ab|aab|aa|a)' with reverse sorting as in your proposal.The naive solution is easy to generate but I'm sceptical about its cost effectiveness. On the other hand I do not want to investigate this matter if somebody else already did it thoroughly. Regards, Kay -- http://mail.python.org/mailman/listinfo/python-list
Re: Seeking regex optimizer
Kay Schluehr wrote: > I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a > regular expression sx from it, such that sx.match(s) yields a SRE_Match > object when s starts with an s_i for one i in [0,...,n]. There might > be relations between those strings: s_k.startswith(s_1) -> True or > s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa', > ...,'...ab']. For this reason SRE_Match should provide the longest > possible match. > > Is there a Python module able to create an optimized regex rx from ls > for the given constraints? > > Regards, > Kay A start would be: regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")" But the above does not work if you have special characters in your strings. You say you want something that is optimised. What have have you tried? - Pad. -- http://mail.python.org/mailman/listinfo/python-list
Seeking regex optimizer
I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a regular expression sx from it, such that sx.match(s) yields a SRE_Match object when s starts with an s_i for one i in [0,...,n]. There might be relations between those strings: s_k.startswith(s_1) -> True or s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa', ...,'...ab']. For this reason SRE_Match should provide the longest possible match. Is there a Python module able to create an optimized regex rx from ls for the given constraints? Regards, Kay -- http://mail.python.org/mailman/listinfo/python-list