Thanks for the reply

> I can tell you that the SQLite3 command line program uses threading and is 
> extremely fast and optimised, it is no surprise that the query takes much 
> longer in many other programs using a library or even compiled-in SQLite 
> code. If it takes 50 seconds piped into SQLite then 100+ seconds in Perl is 
> not completely weird.
> 
> You say that the SELECT on the Table is taking 13ms/call, but I assume you 
> mean the SELECT statement takes 13ms to run from start to finish and copy all 
> the selected DB content into Perl text arrays... this may be a bottleneck. 
> How many lines (records) are returned by the queries on average? For 10K+ 
> lines, 13ms would be a fantastic result - but on a 43K row table I doubt that 
> very much.

I have no doubt that sqlite3 is fast and optimised, thats why we like it. 

My comments before on the speed difference of 50 secs for doing 5,000 selects 
on the database (approx) vs 120 seconds for doing ALL the Perl work which 
includes the 5,000 individual SELECT?s. My assumption was that sqlite3 is very 
good at its job and that using sqlite3 directly is as good a performance as we 
can get. So thats my baseline to work from.

Each select will return either one row or no rows. The text column is unique 
across the table. 

We see the difference of 10ms and 13ms as acceptable in relative terms. If Perl 
adds 0.03ms to the select query we can understand that as thats the overhead of 
using Perl as the database manager. What we are trying to understand, and I 
apologise for stating it inelegantly or not being clear, is whether 10ms is 
acceptable performance to pull a single line out of a database using a select 
on a text field in a table that has an index on it. 

If 10ms is the very best sqlite3 can do using nothing but a select statement 
directly onto the database using sqlite3, with no overhead of Perl then we are 
comfortable with the13ms it takes in Perl, We understand the 0.03ms extra that 
Perl adds, we?re trying to see if the 10ms is acceptable. That only works out 
at 100 selects per second, but the test pages we have seen suggests we should 
get far, far faster. 

> 
> The most immediate optimisation I can see is making the column collation 
> NOCASE (as opposed to the Index). This will avoid running the collation 
> function on every item. SQLite may actually optimise this already and I don't 
> know if it does, but it's worth testing.

We will try this and report back.

> Another thing I would do is add a column which uses tokens for strings, as it 
> seems these are standard repeating strings which may be reduced to a simple 
> Integer, which will be much faster to look up than long string comparisons.

Sadly the text column is not standard repeating strings. The aim of the table 
is to hold all the myriad of ways that Transport for London can say similar 
things, e.g. they will state ?Traffic is normal for the time of day?, ?Traffic 
is normal?, ?Normal Traffic?, Traffic is normal this morning?, "trafic is slow? 
(the spelling mistake is deliberate) and so on and so on. 

Add in all the various ways you can descriptor lane closures, traffic jams, 
traffic lights being out, bridge closures and openings. A person writes the 
update for each disruption and they often do use the same words but they 
misspell something, they change things round, they add new stuff in. 

The RAG table is a way to ?normalise? that information so that we have a 
database of what has been said, what we classified it as (Red/Amber/Green), how 
we calculated that RAG status, we also shorten and stem the words and do 
Bayesian analysis on new versions of phrases to try and work out what is meant. 
Its a key table for us and a lot of pre-processing work goes into creating it. 

We wish there was a standard set of phrases but there isn?t. The 43K lines 
demonstrate it, we are constantly surprised by the new ways people find to 
describe exactly the same situation :) We had looked at your idea previously 
and assumed that we could use a similar approach, but the wide range of text we 
found stopped that working. Using natural language stemming and Bayesian 
analysis got us closer but its still imperfect

> If it isn't part of the usual maintenance, try re-pack the DB file using 
> VACUUM.

We?ll check tor see if we do that and if not we?ll try it and report back.
> 
> In normal Code we would of course compute and re-use the prepared statements 
> shaving off another significant bit of time, but not sure if the Perl 
> interface allows for this or if it will reduce significant time for those 
> straight-forward queries.

We already use prepared statements wherever possible. Its pretty much standard 
coding practise for us. We also tried removing the prepared statement for this 
particular query and found it made little difference which surprised us. 

> Looking at your system specs, it's hard to imagine your results NOT being 
> significantly faster than the quoted test on the SQLite site (Since then, 
> even SQLite itself has undergone many significant performance improvements). 
> If the above still doesn't make it much faster I'd start checking for 
> hardware bottlenecks or other settings that might influence the speed.

We have done these tests on a number of servers, our ?production? servers, 
though we are not yet live, shows the same overall slowdown and thats a clean 
build and uses Linux rather than Mac OS X which is our development environment. 
The fastest box we have is the Mac so we use that to check things out. 

We?ll spend the morning here in London going through the various changes and 
logging what we are doing to see the effects on the database. Other people have 
replied and we?re trying to put a plan together for the day. 

Thanks for taking the time to help, its much appreciated,

Rob

Reply via email to