Ulrik Petersen wrote:
Hi,
responding to myself...
Ulrik Petersen wrote:
onemind wrote:
Thanks,
The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.
What technology would be best suited for this task? I just assumed
that a
databse would be ideal, why do you say sql isn't suited for this and
what
is?
Thanks again.
[Snip]
Having said all this, the fastest way would probably be to use an
in-memory datastructure, and simply query that in-memory. One
possible -- and very simple -- solution would be to have a hash-map
for every character you wished to be able to search, then store
pointers to the strings of the words in each hash-map. That would
make your lookup-times be O(m), where m is the number of letters to
search for, rather than O(n), where n is the number of words.
What I said above is complete and utter BS. Sorry.
You might want to look into "bitsets", or "finger-prints", as
Information Retrieval specialists like to call them.
The basic idea is that you make a bitset out of each word, with one bit
for each of the "features" you want to be on or off for that word. For
your purposes, probably you want each bit in the bitset to represent the
presence or absence of one letter. If you only target the 26 letters of
the English alphabet, a 32 bit integer will suffice.
You can store such a bitset in a column in SQLite, say, "fingerprint".
Compute this as you insert the word.
Then use the "&" operator (bitwise-and) of SQLite's language to filter
out those that you don't want.
Say you are interested in those words which do contain "a" and "b", and
"c". Say that "a" is bit 1, and "b" is bit 2, "c" is bit 3. Then you
OR these together (1|2|4=7), giving the value 7 to the &-operator:
SELECT * FROM words WHERE fingerprint & 7;
HTH
Ulrik Petersen
This is a very fast method you could use for an SQL lookup. We use it
in a text search product and it is effective. You might use a 64 bit
word as a bitmap of the alphanumeric character occurrence in the string
in the database and have that as a column in your table. Make it an
index and then you can find all string occurrences very quickly with
simple SQL, but at the cost of having to have a function to generate the
bitmap on an insertion. That function can be activated by a trigger.
You would use a version of the function to generate your search key from
your chosen search string. The encoding can basically be performed
efficiently by a table lookup and an OR.