Ulrik Petersen wrote:
Hi,

responding to myself...

Ulrik Petersen wrote:

onemind wrote:

Thanks,

The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.

What technology would be best suited for this task? I just assumed that a databse would be ideal, why do you say sql isn't suited for this and what
is?

Thanks again.


[Snip]

Having said all this, the fastest way would probably be to use an in-memory datastructure, and simply query that in-memory. One possible -- and very simple -- solution would be to have a hash-map for every character you wished to be able to search, then store pointers to the strings of the words in each hash-map. That would make your lookup-times be O(m), where m is the number of letters to search for, rather than O(n), where n is the number of words.


What I said above is complete and utter BS.  Sorry.

You might want to look into "bitsets", or "finger-prints", as Information Retrieval specialists like to call them.

The basic idea is that you make a bitset out of each word, with one bit for each of the "features" you want to be on or off for that word. For your purposes, probably you want each bit in the bitset to represent the presence or absence of one letter. If you only target the 26 letters of the English alphabet, a 32 bit integer will suffice.

You can store such a bitset in a column in SQLite, say, "fingerprint". Compute this as you insert the word.

Then use the "&" operator (bitwise-and) of SQLite's language to filter out those that you don't want.

Say you are interested in those words which do contain "a" and "b", and "c". Say that "a" is bit 1, and "b" is bit 2, "c" is bit 3. Then you OR these together (1|2|4=7), giving the value 7 to the &-operator:

SELECT * FROM words WHERE fingerprint & 7;


HTH


Ulrik Petersen

This is a very fast method you could use for an SQL lookup. We use it in a text search product and it is effective. You might use a 64 bit word as a bitmap of the alphanumeric character occurrence in the string in the database and have that as a column in your table. Make it an index and then you can find all string occurrences very quickly with simple SQL, but at the cost of having to have a function to generate the bitmap on an insertion. That function can be activated by a trigger.

You would use a version of the function to generate your search key from your chosen search string. The encoding can basically be performed efficiently by a table lookup and an OR.

Reply via email to