If you can use third party library, I think you can use Aho-Corasick algorithm.
https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm https://pypi.python.org/pypi/pyahocorasick/ On Sat, Feb 25, 2017 at 3:54 AM, <kar6...@gmail.com> wrote: > I have a task to search for multiple patterns in incoming string and replace > with matched patterns, I'm storing all pattern as keys in dict and > replacements as values, I'm using regex for compiling all the pattern and > using the sub method on pattern object for replacement. But the problem I > have a tens of millions of rows, that I need to check for pattern which is > about 1000 and this is turns out to be a very expensive operation. > > What can be done to optimize it. Also I have special characters for matching, > where can I specify raw string combinations. > > for example is the search string is not a variable we can say > > re.search(r"\$%^search_text", "replace_text", "some_text") but when I read > from the dict where shd I place the "r" keyword, unfortunately putting inside > key doesnt work "r key" like this.... > > Pseudo code > > for string in genobj_of_million_strings: > pattern = re.compile('|'.join(regex_map.keys())) > return pattern.sub(lambda x: regex_map[x], string) > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list