If you can use third party library, I think you can use Aho-Corasick algorithm.

https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

https://pypi.python.org/pypi/pyahocorasick/

On Sat, Feb 25, 2017 at 3:54 AM,  <kar6...@gmail.com> wrote:
> I have a task to search for multiple patterns in incoming string and replace 
> with matched patterns, I'm storing all pattern as keys in dict and 
> replacements as values, I'm using regex for compiling all the pattern and 
> using the sub method on pattern object for replacement. But the problem I 
> have a tens of millions of rows, that I need to check for pattern which is 
> about 1000 and this is turns out to be a very expensive operation.
>
> What can be done to optimize it. Also I have special characters for matching, 
> where can I specify raw string combinations.
>
> for example is the search string is not a variable we can say
>
> re.search(r"\$%^search_text", "replace_text", "some_text") but when I read 
> from the dict where shd I place the "r" keyword, unfortunately putting inside 
> key doesnt work "r key" like this....
>
> Pseudo code
>
> for string in genobj_of_million_strings:
>    pattern = re.compile('|'.join(regex_map.keys()))
>    return pattern.sub(lambda x: regex_map[x], string)
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to