Kent Johnson wrote:
André Søreng wrote:


Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?


If you can split some_string into individual words, you could look them up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in known_words ]


Kent


André


That is not exactly what I want. It should discover if some of the predefined words appear as substrings, not only as equal words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2 and word1 should be detected. -- http://mail.python.org/mailman/listinfo/python-list

Reply via email to