On Sun, Jun 25, 2006 at 07:54:13AM -0700, onemind wrote:
> The thing is, i am going to need to use different letters each time to
> search through over 200,000 words in a database and it needs to be fast.

The quick and dirty way to do this using a sqlite would be to keep a
separate column from the work column with the letters in the word
sorted.   You would have to keep this column up to date yourself though.
I do not believe there is an out-of-the-box way to populate this column
with sqlite.  To do so would involve a trigger and a user defined
function.

In other words you would have:

    sqlite> CREATE TABLE words (word text, sorted text);

And inserting into the table :

    sqlite> insert into words(word,sorted) values ("quads","adqsu");
    sqlite> insert into words(word,sorted) values ("quidas","adiqsu");
    sqlite> insert into words(word,sorted) values ("queen","eenqu");

Then your searches would be in the vein of:

    sqlite> select word from words where sorted like "%a%d%q%s%";
    quads
    quidas

But, this would result in a full table scan for all searches and as a
result be O(N) and would not be very efficient.  Additionally it would
probably be easier to just write a script that would do the same thing
with a textfile of the words and search through it.

Although... sounds like a fun little project...

    % ./init-db
    Creating words.db
    Inserting words from /usr/share/dict/words...
    Inserted 483523 words into db in 58.900854 seconds.


    % ./search-db aqds
    Searching for aqds -> SELECT word FROM words WHERE sorted LIKE '%a%d%q%s%'
    found 754 results in 1.521026 seconds

I still wouldn't suggest this method, but it is a fun exercise.  Each
search in this manner is a full table scan.

> What technology would be best suited for this task? I just assumed that a
> databse would be ideal, why do you say sql isn't suited for this and what
> is?

Others have given good suggestions, Ulrick's bitset is particularly
nice.  Well, its a lazy sunday, lets see what happens.  using the same
approach as above, but using a 'bitset' column instead of 'sorted'.
Also an index is created on the bitset column.

    % ./init-db-bitset
    Creating words-bitset.db
    Inserting words from /usr/share/dict/words...
    Inserted 483523 words into db in 75.368803 seconds.

    % ./search-db-bitset aqds
    Searching for words aqds -> SELECT word FROM words WHERE (bitset & 327689) 
= 327689
    found 754 results in 0.902595 seconds

Although I believe it should still have to do a table scan here. 
But bitwise and is faster than a string comparison in any case.  

Since this also does a full table scan, you'll probably want something
more along the lines of an inverted index of the words by letter in some
sort of dedicated data structure.  

I started playing with this using sqlite to try and do an inverted index
by letter here, but it didn't get close to the performance of what the
bitset was doing.

Ahh, what a fun way to spend part a Sunday :-)

enjoy,

-jeremy

-- 
========================================================================
 Jeremy Hinegardner                              [EMAIL PROTECTED] 

Reply via email to