"Andy Dingley" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > > [EMAIL PROTECTED] wrote: > >> I am looking for python code that takes as input a list of strings >> [...] and outputs the python regular expression > > (s1|s2|s3|s4|s5) > for strings of "s1" etc. > > Regex compilers are themselves quite good at optimising beyond this >
It turns out this problem is a little trickier, especially when one of your strings is a leading subset of another. For instance, let's say we are looking for comparison operators, one of <, >, =, >=, <=, or !=. Simply concatenating with intervening '|' characters gives this regexp: "<|>|=|<=|>=|!=" However, the leading '<' and '>' alternatives mask the later '<=', '<>', and '>=' alternatives, and so the regexp never matches the longer options (I was not able to find a greediness switch that would override this). So when searching "a >= b" we get this: >>> re.findall("<|>|=|<=|>=|!=", "a >= b") ['>', '='] By moving the longer option to the front of the regexp, the longer option is no longer masked by the shorter: >>> re.findall(">=|<|>|=|<=|!=", "a >= b") ['>='] You also can't just concatenate input strings, since it is very likely they will contain one of the magic re symbols ()[]?*./\+, etc. So re.escape needs to be called to add the necessary '\'s. Here is an extract from pyparsing's oneOf function that does something similar, that handles the leading substring masking problem, and escapes the input strings, before concatenating them to a valid regexp. Of course, the simplest approach would be to just sort the symbols by descending length, but you may have some a priori knowledge that 'X' is a very common match, and want that option tested as early as possible. So this method only reorders strings if there is a masking problem. def createREfrom( symbols ): #symbols is a list of strings isequal = ( lambda a,b: a == b ) masks = ( lambda a,b: b.startswith(a) ) i = 0 while i < len(symbols)-1: cur = symbols[i] for j,other in enumerate(symbols[i+1:]): if ( isequal(other, cur) ): del symbols[i+j+1] break elif ( masks(cur, other) ): del symbols[i+j+1] symbols.insert(i,other) cur = other break else: i += 1 return "|".join( [ re.escape(sym) for sym in symbols] ) >>> print createREfrom(["ABC","ABCDEF","ABCGHI"]) ABCDEF|ABCGHI|ABC >>> print createREfrom("> < = <= >= != << >> <<< >>>".split()) \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\= >>> re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <= >>> b") ['<='] Note, this does not do any optimization, such as collapsing "ABCDEF|ABCGHI" to "ABC(DEF|GHI)". I think there are some recipes in the Python cookbook for such optimization. -- Paul -- http://mail.python.org/mailman/listinfo/python-list