Re: [sqlite] Need sql query help

2006-06-25 Thread C.Peachment
On Sun, 25 Jun 2006 18:24:50 -0700 (PDT), onemind wrote:

>It took over 8 hours, so if anyone could tell me a text command that would
>do this same task of importing a txt file into a table through the sqlite3
>command line that would be great. It must be the gui slowing it down
>somehow.

Repeating details from my earlier post:
--
create table wordlist (word text);

select current_time;

begin transaction;

insert into wordlist values ("test0");
insert into wordlist values ("test1");
...
insert into wordlist values ("test18");
insert into wordlist values ("test19");
commit transaction;

select current_time;

select count(*) from wordlist;
--

It took 8 SECONDS, not 8 HOURS.

The above lines were taken directly from the
text file that I had created and then imported
into the sqlite3 command line utility by typing
the command:

.read word.lst

The only change for this email was to delete
the 199,996 intermediate text lines that would
just get in the way of this example.


The suggestion made to use a bitset is an
excellent one. Put an index on the bitset field.
The sql AND operator will find all combinations
very quickly.

Here is sample code to create the content of
the bitset field from the characters in each word.
If you wrap this into a small program that reads
the word list and creates the above insert
statements then you can also insert the bitset
value as a second field in each insert statement.

unsigned long int
GetBitSetOf (char * InWord)


  int J, K;
  unsigned long int BitSet; // must be at least 26 bits wide

  K = strlen(InWord);
  BitSet = 0;
  for (J = 0; J < K; J++) {
BitSet |= 1 << (InWord[J] - 65);// assumes all letters are uppercase 
only
  }
  return (BitSet);
} // GetBitSetOf






Re: [sqlite] Need sql query help

2006-06-25 Thread onemind

Thanks for all of the great ideas :)

Plently of techniques to work on there.

Just incase your interested, i woke up this morning and all the words
finally made it into sqlite :)

It took over 8 hours, so if anyone could tell me a text command that would
do this same task of importing a txt file into a table through the sqlite3
command line that would be great. It must be the gui slowing it down
somehow.

Anyway, i ran the like query using heaps of different combinations of
letters and the results are returned instantly :)

More than fast enough for my purposes but i will definately check out some
of the other methods just for fun :)

It seems we under estimated sqlites speed for this task.

Anyway, thanks again for all your replies, now i have to figure out how to
sort through the resulted word list and score each letter with certain
points to find the highest scoring words. Incase you haven't guessed, it is
for a scrabble AI project. Am trying to make an AI player in scrabble that
finds the highest scoring word with given tiles and pattern recognition of
the words that are on the board.

In other words, i am trying to clone this site:
http://www.scrabblewordfinder.com/

Cheers :)
--
View this message in context: 
http://www.nabble.com/Need-sql-query-help-t1844399.html#a5040314
Sent from the SQLite forum at Nabble.com.



Re: [sqlite] Need sql query help

2006-06-25 Thread Jeremy Hinegardner
On Sun, Jun 25, 2006 at 07:54:13AM -0700, onemind wrote:
> The thing is, i am going to need to use different letters each time to
> search through over 200,000 words in a database and it needs to be fast.

The quick and dirty way to do this using a sqlite would be to keep a
separate column from the work column with the letters in the word
sorted.   You would have to keep this column up to date yourself though.
I do not believe there is an out-of-the-box way to populate this column
with sqlite.  To do so would involve a trigger and a user defined
function.

In other words you would have:

sqlite> CREATE TABLE words (word text, sorted text);

And inserting into the table :

sqlite> insert into words(word,sorted) values ("quads","adqsu");
sqlite> insert into words(word,sorted) values ("quidas","adiqsu");
sqlite> insert into words(word,sorted) values ("queen","eenqu");

Then your searches would be in the vein of:

sqlite> select word from words where sorted like "%a%d%q%s%";
quads
quidas

But, this would result in a full table scan for all searches and as a
result be O(N) and would not be very efficient.  Additionally it would
probably be easier to just write a script that would do the same thing
with a textfile of the words and search through it.

Although... sounds like a fun little project...

% ./init-db
Creating words.db
Inserting words from /usr/share/dict/words...
Inserted 483523 words into db in 58.900854 seconds.


% ./search-db aqds
Searching for aqds -> SELECT word FROM words WHERE sorted LIKE '%a%d%q%s%'
found 754 results in 1.521026 seconds

I still wouldn't suggest this method, but it is a fun exercise.  Each
search in this manner is a full table scan.

> What technology would be best suited for this task? I just assumed that a
> databse would be ideal, why do you say sql isn't suited for this and what
> is?

Others have given good suggestions, Ulrick's bitset is particularly
nice.  Well, its a lazy sunday, lets see what happens.  using the same
approach as above, but using a 'bitset' column instead of 'sorted'.
Also an index is created on the bitset column.

% ./init-db-bitset
Creating words-bitset.db
Inserting words from /usr/share/dict/words...
Inserted 483523 words into db in 75.368803 seconds.

% ./search-db-bitset aqds
Searching for words aqds -> SELECT word FROM words WHERE (bitset & 327689) 
= 327689
found 754 results in 0.902595 seconds

Although I believe it should still have to do a table scan here. 
But bitwise and is faster than a string comparison in any case.  

Since this also does a full table scan, you'll probably want something
more along the lines of an inverted index of the words by letter in some
sort of dedicated data structure.  

I started playing with this using sqlite to try and do an inverted index
by letter here, but it didn't get close to the performance of what the
bitset was doing.

Ahh, what a fun way to spend part a Sunday :-)

enjoy,

-jeremy

-- 

 Jeremy Hinegardner  [EMAIL PROTECTED] 



Re: [sqlite] Need sql query help

2006-06-25 Thread John Stanton

A. Pagaltzis wrote:

* onemind <[EMAIL PROTECTED]> [2006-06-25 17:00]:


The thing is, i am going to need to use different letters each
time to search through over 200,000 words in a database and it
needs to be fast.



200,000 words is nothing. If they’re 5 letters on average, that’s
some 1.1MB of data. You can grep that in milliseconds.



What technology would be best suited for this task?



Put the lot into a flat textfile, read it into memory, and do a
string scan.



I just assumed that a databse would be ideal, why do you say
sql isn't suited for this and what is?



Because you’re not indexing any of the facts you query. You’re
just doing a scan across all of the table, doing string matches
on one column in each row. There’s no point in using a database
for that.

Regards,
When we have a problem like this we would mmap a flat file and use a 
fast string search algorithm like Boyer-Moore.  It is about as fast as 
it gets if you are looking for something ad hoc and cannot use an index.


Re: [sqlite] Need sql query help

2006-06-25 Thread John Stanton

Ulrik Petersen wrote:

Hi,

responding to myself...

Ulrik Petersen wrote:


onemind wrote:


Thanks,

The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.

What technology would be best suited for this task? I just assumed 
that a
databse would be ideal, why do you say sql isn't suited for this and 
what

is?

Thanks again.



[Snip]

Having said all this, the fastest way would probably be to use an 
in-memory datastructure, and simply query that in-memory.  One 
possible -- and very simple -- solution would be to have a hash-map 
for every character you wished to be able to search, then store 
pointers to the strings of the words in each hash-map.  That would 
make your lookup-times be O(m), where m is the number of letters to 
search for, rather than O(n), where n is the number of words.



What I said above is complete and utter BS.  Sorry.

You might want to look into "bitsets", or "finger-prints", as 
Information Retrieval specialists like to call them.


The basic idea is that you make a bitset out of each word, with one bit 
for each of the "features" you want to be on or off for that word.  For 
your purposes, probably you want each bit in the bitset to represent the 
presence or absence of one letter.  If you only target the 26 letters of 
the English alphabet,  a 32 bit integer will suffice.


You can store such a bitset in a column in SQLite, say, "fingerprint".  
Compute this as you insert the word.


Then use  the "&" operator (bitwise-and) of SQLite's  language to filter 
out those that you don't want.


Say you are interested in those words which do contain "a" and "b", and 
"c".  Say that "a" is bit 1, and "b" is bit 2, "c" is bit 3.  Then you 
OR these together (1|2|4=7), giving the value 7 to the &-operator:


SELECT * FROM words WHERE fingerprint & 7;


HTH


Ulrik Petersen

This is a very fast method you could use for an SQL lookup.  We use it 
in a text search product and it is effective.  You might use a 64 bit 
word as a bitmap of the alphanumeric character occurrence in the string 
in the database and have that as a column in your table.  Make it an 
index and then you can find all string occurrences very quickly with 
simple SQL, but at the cost of having to have a function to generate the 
bitmap on an insertion.  That function can be activated by a trigger.


You would use a version of the function to generate your search key from 
your chosen search string.  The encoding can basically be performed 
efficiently by a table lookup and an OR.


Re: [sqlite] Need sql query help

2006-06-25 Thread A. Pagaltzis
* onemind <[EMAIL PROTECTED]> [2006-06-25 17:00]:
> The thing is, i am going to need to use different letters each
> time to search through over 200,000 words in a database and it
> needs to be fast.

200,000 words is nothing. If they’re 5 letters on average, that’s
some 1.1MB of data. You can grep that in milliseconds.

> What technology would be best suited for this task?

Put the lot into a flat textfile, read it into memory, and do a
string scan.

> I just assumed that a databse would be ideal, why do you say
> sql isn't suited for this and what is?

Because you’re not indexing any of the facts you query. You’re
just doing a scan across all of the table, doing string matches
on one column in each row. There’s no point in using a database
for that.

Regards,
-- 
Aristotle Pagaltzis // 


Re: [sqlite] Need sql query help

2006-06-25 Thread A. Pagaltzis
* Ulrik Petersen <[EMAIL PROTECTED]> [2006-06-25 17:55]:
> 5) Use the function with the regex '[spqd]' to search for words
> containing the letters "s", "p", "q", OR "d".  Doing it for all
> letters (AND) may be doable with a single regex,

It is doable with an NFA engine like PCRE, but it’s complicated
to express and will incur so much backtracking that it’ll run
much slower than doing four separate matches.

With DFA engine such as egrep’s you can’t express it in a single
pattern at all.

For ultimate performance on strings, you’ll need to walk the
string using a loop in a machine-oriented language like C and
check characters directly.

If you need to go even faster, then you’ll need an inverted
index on letters for the whole dataset.

Regards,
-- 
Aristotle Pagaltzis // 


Re: [sqlite] Need sql query help

2006-06-25 Thread Ulrik Petersen

Hi,

responding to myself...

Ulrik Petersen wrote:

onemind wrote:

Thanks,

The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.

What technology would be best suited for this task? I just assumed 
that a
databse would be ideal, why do you say sql isn't suited for this and 
what

is?

Thanks again.


[Snip]

Having said all this, the fastest way would probably be to use an 
in-memory datastructure, and simply query that in-memory.  One 
possible -- and very simple -- solution would be to have a hash-map 
for every character you wished to be able to search, then store 
pointers to the strings of the words in each hash-map.  That would 
make your lookup-times be O(m), where m is the number of letters to 
search for, rather than O(n), where n is the number of words.


What I said above is complete and utter BS.  Sorry.

You might want to look into "bitsets", or "finger-prints", as 
Information Retrieval specialists like to call them.


The basic idea is that you make a bitset out of each word, with one bit 
for each of the "features" you want to be on or off for that word.  For 
your purposes, probably you want each bit in the bitset to represent the 
presence or absence of one letter.  If you only target the 26 letters of 
the English alphabet,  a 32 bit integer will suffice.


You can store such a bitset in a column in SQLite, say, "fingerprint".  
Compute this as you insert the word.


Then use  the "&" operator (bitwise-and) of SQLite's  language to filter 
out those that you don't want.


Say you are interested in those words which do contain "a" and "b", and 
"c".  Say that "a" is bit 1, and "b" is bit 2, "c" is bit 3.  Then you 
OR these together (1|2|4=7), giving the value 7 to the &-operator:


SELECT * FROM words WHERE fingerprint & 7;


HTH


Ulrik Petersen



Re: [sqlite] Need sql query help

2006-06-25 Thread John Stanton

onemind wrote:

Thanks,

The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.

What technology would be best suited for this task? I just assumed that a
databse would be ideal, why do you say sql isn't suited for this and what
is?

Thanks again.
--
View this message in context: 
http://www.nabble.com/Need-sql-query-help-t1844399.html#a5034782
Sent from the SQLite forum at Nabble.com.

A regular expression search on a flat file with only 200,000 words would 
be fast and most likely achieve your objective.  Run some trials with grep.


Re: [sqlite] Need sql query help

2006-06-25 Thread John Stanton

A. Pagaltzis wrote:

* onemind <[EMAIL PROTECTED]> [2006-06-25 16:05]:


If i had a table wit a word column that had a huge list of
words and i wanted to select every word that contained all
these letters "qdsa". 



SELECT *
FROM words
WHERE
word LIKE '%q%'
AND word LIKE '%d%'
AND word LIKE '%s%'
AND word LIKE '%a%'

And that’s going to be slow like molasses. It’s not something SQL
is well suited to.

If you need to do this a lot, I suggest precomputing the kinds of
facts about each word that you’ll want to query and storing them
in a column or dependent table so you can create indices and
query them quickly.

Of course if the performance of the simpleminded approach is
sufficient for you, then all the better.

Regards,
Writing a function which does the string match would be a more efficient 
approach.


Re: [sqlite] Need sql query help

2006-06-25 Thread John Stanton

onemind wrote:
Hi, 
I was hoping someone could tell me if it was possible to select all words
containing ceratin letters. 

Eg 


If i had a table wit a word column that had a huge list of words and i
wanted to select every word that contained all these letters "qdsa". 

Then it would return the words: 

quads 
quidas 

ect 

but wouldn't return 

queen 

because queen does not contain all letters specified. 

Would it be something like this: 

select * from word where word = "qsda"; 

THat doesn't work by the way :) 

Any help would be great. 

Thanks 



--
View this message in context: 
http://www.nabble.com/Need-sql-query-help-t1844399.html#a5034347
Sent from the SQLite forum at Nabble.com.


Try "LIKE"


Re: [sqlite] Need sql query help

2006-06-25 Thread Ulrik Petersen

onemind wrote:

Thanks,

The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.

What technology would be best suited for this task? I just assumed that a
databse would be ideal, why do you say sql isn't suited for this and what
is?

Thanks again.


Derrel Lipman's recent post may answer your question better, but here's 
a sketch of a solution that involves SQLite.



1) Find a suitable regular expression library, say, PCRE (Perl 
Compatible Regular Expressions) -- www.prce.org


2) Write a C function to be used from within SQLite, using the 
instructions found at:


http://www.sqlite.org/capi3ref.html#sqlite3_create_function

The C function might be a custom one that, given a string of letters, 
searched for all letters (AND) or any letters (OR), possibly using the 
RegEx library.


3) Recompile SQLite with said regex library added into the SQLite code, 
as well as your C function.


4) Register your C function with SQLite using the above API

5) Use the function with the regex '[spqd]' to search for words 
containing the letters "s", "p", "q", OR "d".  Doing it for all letters 
(AND) may be doable with a single regex, but if not, you can always, in 
your custom function, search for all the letters, mark them off one by 
one as you find them, and return the appropriate value when all have 
been found, otherwise, if you get to the end of the string, then return 
another appropriate value.


Another poster mentioned that you should really test the 
straightforward, simple-minded approach that he mentioned, first.  If it 
is fast enough, then why bother doing it the hard way.  The above 
probably also won't use an index, so it is also an O(n) approach, like 
the simple-minded approach of doing several LIKE's probably is.


200,000 words does not sound like a whole lot.  The first query might be 
a little slow, but if your table fits in memory, then your operating 
system's cache will probably make subsequent queries rather fast.


Having said all this, the fastest way would probably be to use an 
in-memory datastructure, and simply query that in-memory.  One possible 
-- and very simple -- solution would be to have a hash-map for every 
character you wished to be able to search, then store pointers to the 
strings of the words in each hash-map.  That would make your 
lookup-times be O(m), where m is the number of letters to search for, 
rather than O(n), where n is the number of words.



HTH

Ulrik Petersen



Re: [sqlite] Need sql query help

2006-06-25 Thread Derrell . Lipman
onemind <[EMAIL PROTECTED]> writes:

> What technology would be best suited for this task? I just assumed that a
> databse would be ideal, why do you say sql isn't suited for this and what
> is?

Take a look at the Controllable Regex Mutilator, CRM114,
.  It has mechanisms for detecting missing
letters, extra letters, changed letters, etc.  You'll likely find what you're
looking for there or at least get some good ideas from it.  (One of its
purposes is a spam filter, and it's the best at that of anything I've found!)

Derrell


Re: [sqlite] Need sql query help

2006-06-25 Thread onemind

Thanks,

The thing is, i am going to need to use different letters each time to
search through over 200,000 words in a database and it needs to be fast.

What technology would be best suited for this task? I just assumed that a
databse would be ideal, why do you say sql isn't suited for this and what
is?

Thanks again.
--
View this message in context: 
http://www.nabble.com/Need-sql-query-help-t1844399.html#a5034782
Sent from the SQLite forum at Nabble.com.



Re: [sqlite] Need sql query help

2006-06-25 Thread A. Pagaltzis
* onemind <[EMAIL PROTECTED]> [2006-06-25 16:05]:
> If i had a table wit a word column that had a huge list of
> words and i wanted to select every word that contained all
> these letters "qdsa". 

SELECT *
FROM words
WHERE
word LIKE '%q%'
AND word LIKE '%d%'
AND word LIKE '%s%'
AND word LIKE '%a%'

And that’s going to be slow like molasses. It’s not something SQL
is well suited to.

If you need to do this a lot, I suggest precomputing the kinds of
facts about each word that you’ll want to query and storing them
in a column or dependent table so you can create indices and
query them quickly.

Of course if the performance of the simpleminded approach is
sufficient for you, then all the better.

Regards,
-- 
Aristotle Pagaltzis // 


[sqlite] Need sql query help

2006-06-25 Thread onemind

Hi, 
I was hoping someone could tell me if it was possible to select all words
containing ceratin letters. 

Eg 

If i had a table wit a word column that had a huge list of words and i
wanted to select every word that contained all these letters "qdsa". 

Then it would return the words: 

quads 
quidas 

ect 

but wouldn't return 

queen 

because queen does not contain all letters specified. 

Would it be something like this: 

select * from word where word = "qsda"; 

THat doesn't work by the way :) 

Any help would be great. 

Thanks 


--
View this message in context: 
http://www.nabble.com/Need-sql-query-help-t1844399.html#a5034347
Sent from the SQLite forum at Nabble.com.