Daniel Yoo wrote:
Kent Johnson <[EMAIL PROTECTED]> wrote:

:> Given a string, I want to find all ocurrences of
:> certain predefined words in that string. Problem is, the list of
:> words that should be detected can be in the order of thousands.
:> :> With the re module, this can be solved something like this:
:> :> import re
:> :> r = re.compile("word1|word2|word3|.......|wordN")
:> r.findall(some_string)


The internal data structure that encodes that set of keywords is
probably humongous.  An alternative approach to this problem is to
tokenize your string into words, and then check to see if each word is
in a defined list of "keywords".  This works if your keywords are
single words:

###
keywords = set([word1, word2, ...])
matchingWords = set(re.findall(r'\w+')).intersection(keywords)
###

Would this approach work for you?



Otherwise, you may want to look at a specialized data structure for
doing mutiple keyword matching; I had an older module that wrapped
around a suffix tree:

    http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

It looks like other folks, thankfully, have written other
implementations of suffix trees:

    http://cs.haifa.ac.il/~shlomo/suffix_tree/

Another approach is something called the Aho-Corasick algorithm:

    http://portal.acm.org/citation.cfm?doid=360825.360855

though I haven't been able to find a nice Python module for this yet.


Best of wishes to you!

Thanks, seems like the Aho-Corasick algorithm is along the lines of what I was looking for, but have not read the article completely yet.

Also:
http://alexandria.tue.nl/extra1/wskrap/publichtml/200407.pdf

provided several alternative algorithms.

André

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to