onemind wrote:
Thanks,

The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.

What technology would be best suited for this task? I just assumed that a
databse would be ideal, why do you say sql isn't suited for this and what
is?

Thanks again.

Derrel Lipman's recent post may answer your question better, but here's a sketch of a solution that involves SQLite.


1) Find a suitable regular expression library, say, PCRE (Perl Compatible Regular Expressions) -- www.prce.org

2) Write a C function to be used from within SQLite, using the instructions found at:

http://www.sqlite.org/capi3ref.html#sqlite3_create_function

The C function might be a custom one that, given a string of letters, searched for all letters (AND) or any letters (OR), possibly using the RegEx library.

3) Recompile SQLite with said regex library added into the SQLite code, as well as your C function.

4) Register your C function with SQLite using the above API

5) Use the function with the regex '[spqd]' to search for words containing the letters "s", "p", "q", OR "d". Doing it for all letters (AND) may be doable with a single regex, but if not, you can always, in your custom function, search for all the letters, mark them off one by one as you find them, and return the appropriate value when all have been found, otherwise, if you get to the end of the string, then return another appropriate value.

Another poster mentioned that you should really test the straightforward, simple-minded approach that he mentioned, first. If it is fast enough, then why bother doing it the hard way. The above probably also won't use an index, so it is also an O(n) approach, like the simple-minded approach of doing several LIKE's probably is.

200,000 words does not sound like a whole lot. The first query might be a little slow, but if your table fits in memory, then your operating system's cache will probably make subsequent queries rather fast.

Having said all this, the fastest way would probably be to use an in-memory datastructure, and simply query that in-memory. One possible -- and very simple -- solution would be to have a hash-map for every character you wished to be able to search, then store pointers to the strings of the words in each hash-map. That would make your lookup-times be O(m), where m is the number of letters to search for, rather than O(n), where n is the number of words.


HTH

Ulrik Petersen

Reply via email to