> My current planned approach is to have each file added read and keywords > extracted (i.e. variable names, libraries imported etc.) and used as > keys in a dictionary, and each key points to a list of files containing > it (files will be stored in a zip or similar). > > i.e. > > words = { > > "imaplib": ["X.py"], > "smtplib": ["X.py", "Y.py"] > > } > > And if I do a search for "smtplib" or "imaplib" it'll return X.py > first, as it contains both, and Y.py second. > > Is there a more efficient approach? I don't think a relational DB really > suits this sort of task at present, but I'm wondering about search > times.
Hi Liam, That sounds fine, and it's a popular approach. What you have there is known as an "inverted index": http://www.nist.gov/dads/HTML/invertedIndex.html and it's the way that many search engines know how to link up keywords to their respective documents. > I've googled for search theory, and hit c2's wiki up, but I'm having no > luck at finding applicable, understandable theory... a lot of stuff > about trees and nodes and paths through graphs, but I can't correlate > that to my problem. There's a really short-and-sweet book called "Understanding Search Engines: Mathematical Modeling and Text Retrieval" that covers some core ideas: http://www.cs.utk.edu/~berry/usebook/ I'd heartily recommend that one; they do talk about the theory, but from a practitioner's point of view, so it's quite readable. And it is very thin and easy to carry. *grin* _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor