Thanks to everybody who helped with ideas and suggestions on improving the 
performance of a query. We?ve implemented and tested out most of the 
suggestions made and it has produced a startling improvement. 

Here?s a table of the tests we made. We doubled the number of files to we used 
to 20 files as opposed to 10 to try and make sure that we get an average cross 
section of files. So bear this mind if looking at the old figures, basically 
you need to double them. We also did not go back to the same database each time 
as copying and recopying a 16GB file is boring ;)

HTML tables look like junk so this is really text.

1 - Initial Baseline Perl - 213 secs
2 - Adding COLLATE NOCASE to database table RAG and NOT changing the query - 
Run 1 - 112 Secs
3 - Rebuild RAG table without COLLATE NOCASE to check that the performance 
increase was really working  - 234 secs
4 - Adding COLLATE NOCASE to database table RAG and NOT changing the query - 
Run 2 - 99 secs
5 - Rerun performance check again to make sure. No changes to RAG table, 
COLLATE NOCASE still there, Run 3 - 98 secs
6 - Rebuild database using .dump and cat file back into new DB - Run 1   100 
secs - (Odd spike for one of the files which took 11 secs)
7 - Rerun test with existing database to check spike - Run 2 - 90 secs
8 - rerun test yet again just to be sure - Run 3 - 90 secs
9 - Vacuum test not run as rebuild using .dump and cat done instead
10 - Remove all queries to __unknown__. Simple optimisation in Perl - No other 
optimisations such as collate - 153 sec - This run should be compared to 
"baseline perl" and     not to the other optimisations
11 - Remove all queries to __unknown__. Simple optimisation in Perl All other 
optimisations in place - 90 secs - The Perl ?optimisation' has now been 
optimised out by all the other collate stuff. This is odd, we thought this 
would speed things up again - Needs further investigation.
12 - Remove Perl optimsiation and check it still OK - 91 secs
13 - Added ORDER BY Text COLLATE NOCASE ASC to query - 92 secs

The headline figures are we have gone from 213 secs to process 20 files down to 
90 secs to process 20 files. We are running approx 2.5x faster. To get this 
improvement the biggest change was simply adding COLLATE NOCASE to the table 
schema. This saved around 120-130 secs which was brilliant. 

CREATE TABLE "RAG" (
         "Id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
         "Count" integer NOT NULL DEFAULT 0,
         "Text" TEXT COLLATE NOCASE ,
         "Peak" integer,
         "Calculation" integer NOT NULL DEFAULT 0,
         "Red" integer DEFAULT 0,
         "Amber" integer DEFAULT 0,
         "Green" integer DEFAULT 0,
         "BayesAttributes" TEXT
);
INSERT INTO "main".sqlite_sequence (name, seq) VALUES ("RAG", '43330');

-- ----------------------------
--  Indexes structure for table RAG
-- ----------------------------
CREATE UNIQUE INDEX "RAG_Idx1" ON RAG ("Text" COLLATE NOCASE ASC);
CREATE UNIQUE INDEX "RAG_Idx2" ON RAG ("Calculation" COLLATE NOCASE ASC, "Text" 
COLLATE NOCASE ASC);
CREATE INDEX "Rag_Idx3" ON RAG ("Calculation" COLLATE NOCASE ASC, 
"BayesAttributes" COLLATE NOCASE ASC);

We also deleted one of the indexes as well as 

> CREATE UNIQUE INDEX "Rag_Idx4" ON RAG ("Id" COLLATE NOCASE ASC, "Calculation" 
> COLLATE NOCASE ASC, "Peak" COLLATE NOCASE ASC, "Red" COLLATE NOCASE ASC, 
> "Amber" COLLATE NOCASE ASC, "Green" COLLATE NOCASE ASC, "Text" COLLATE NOCASE 
> ASC);

Index Rag_Idx4 was not in the database version we copied from to start our 
testing. Our bad, we were too quick to start testing. However we checked using 
the query plan and didn?t see any issues so we have left it out. 

The second biggest improvement was the 10 secs saved by dumping the database 
and rebuilding from the dump file. We may have got a better increase if we did 
that first and then the COLLATE NOCASE which would probably then have a lesser 
increase :) 

We did add a Perl logic coding optimisation which made a massive difference of 
60-70 secs when run against the baseline, but when we added the optimisation 
back in to the database with COLLATE NOCASE it made no difference. This puzzles 
us as it removes a lot of the SELECT searches  so should still make a 
difference at 90 secs. We?ll investigate further.

Adding ORDER BY Text COLLATE NOCASE ASC to the query made no difference as we 
would expect as we have updated the indexes. 

We also recreated the test where we ran direct SELECT statements into sqlite3

time cat ,output | sqlite3 tfl.sqlite > /dev/null

real    0m0.247s
user    0m0.152s
sys     0m0.096s

This was probably the most revealing of all. It basically has taken no time to 
run 10,551 selects. This implies that we had issues on the database before with 
either fragmentation or indexes being poorly setup.  

When we look at the graphical output for the performance monitoring program we 
find that the bottleneck has gone from there and another area now takes up the 
most time, which is exactly what we would expect, something has to be the 
slowest part of the system. So long as we don?t optimise the idle loop 
(http://c2.com/cgi/wiki?OptimizingTheIdleLoop 
<http://c2.com/cgi/wiki?OptimizingTheIdleLoop>) :)

We are delighted with the results and would have taken a 30% decrease as great 
news, a 250% decrease (I know, I know you can?t take 250% off something) is 
fantastic. We need to add this to our almost production database and add in the 
dumping and rebuilding on a monthly basis. 

Hopefully this may help other people with tuning their databases.

We?d like to say thanks to all the people who offered help and suggestions, 

Simon, Ward, R Smith, Gerd, Clemens

We really appreciate it and if you ever get to London the beers and pizzas are 
on us. 

All the best,

Rob

> On 2 Jul 2015, at 23:16, Rob Willett <rob.sqlite at robertwillett.com> wrote:
> 
> Hi,
> 
> We?re trying to understand whether or not we have a performance problem with 
> our Sqlite database. It may well be that we are ?optimal? for some value of 
> optimal or it may be that we can improve our system. I was hoping to use the 
> experience here to help focus our attention or give some guidance based on 
> real world usage.
> 
> It may well be that we don?t have a problem and our performance is 
> appropriate. Thats also a good result as we can stop worrying about things :)
> 
> This e-mail has quite a lot of information in as we want to give as much 
> information as appropriate. We don?t want to swamp you but we can?t simplify 
> it too much.
> 
> Background
> 
> Our database takes a five minute feed of traffic information in London. This 
> dated is XML based and has approximately 500-700 items in it. These items 
> cover everything from a new traffic jam through to three year old roadworks. 
> Yes there are roadworks in London that go back three years!
> 
> We process this XML feed and pull out chunks of the XML, process it and 
> update our database with it. This database is currently 16GB in size. Our 
> concern is the time taken to process each file every five minutes. It has 
> been getting steadily longer and longer. It started out at around 6 seconds 
> and is now around 22-24 seconds. Whilst we expect processing time to get 
> bigger, we are getting concerned about the performance and decided to have a 
> look.
> 
> Our XML processing system is written in Perl, which has advantages and 
> disadvantages, Probably the biggest advantage is its forgiving for 
> development and allows us to move quickly with processing significantly large 
> amounts of text within the XML. The biggest disadvantage to use is the lack 
> of multithreading so its limited by the clock speed of the CPU. The database 
> (DB) we use is only ever read and written by a single process, we do not 
> network the database or allow any access apart from processing the XML file 
> and getting an output JSON file.  
> 
> As our processing time has got longer, we decided to have a look at where our 
> Perl program was spending most time. For the interested we use the Perl 
> module Devel::NYTProf. This is a well known and well respected performance 
> profiling tool. 
> 
> http://search.cpan.org/~timb/Devel-NYTProf-6.01/lib/Devel/NYTProf.pm 
> <http://search.cpan.org/~timb/Devel-NYTProf-6.01/lib/Devel/NYTProf.pm>
> 
> We do not need to adapt our perl code to use it, you simply involve Perl with 
> -d:NTYProd  <program_name>.
> 
> The output is a nice html web page that allows you to drill down into the 
> sections of the code that take the most time up. We ran the program over ten 
> iterations of files to even things out on a 16GB Sqlite database and looked 
> at the output. We were very surprised to see where most of the time was 
> spent, out of a two minute run 70% of the time was spent in an innocuous 
> function. This was odd!
> 
> We drilled down even further and found that a single select call was 
> occupying 90% of that 70%. The performance profiler was indicating that a 
> select on a table was taking around 13ms/call and we were doing around 5,000 
> calls in our ten file run. The approx 5,000 calls was about right as there 
> are around 500 distinct chunks of information per file. So the numbers seemed 
> to add up. The issue for us was the 13ms per call.
> 
> The actual SQL called 5,000 times is 
> 
> ?select Id,Calculation,Peak,Red,Amber,Green from RAG where text = ??
> 
> RAG is a Red/Amber/Green table and sentence is a simple sentence that 
> describes road conditions. See below for examples.
> 
> The RAG table schema is 
> 
> CREATE TABLE "RAG" (
>        "Id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
>        "Count" integer NOT NULL DEFAULT 0,
>        "Text" TEXT,
>        "Peak" integer,
>        "Calculation" integer NOT NULL DEFAULT 0,
>        "Red" integer DEFAULT 0,
>        "Amber" integer DEFAULT 0,
>        "Green" integer DEFAULT 0,
>        "BayesAttributes" TEXT
> );
> 
> It has four indexes on it 
> 
> CREATE UNIQUE INDEX "RAG_Idx1" ON RAG ("Text" COLLATE NOCASE ASC);
> CREATE UNIQUE INDEX "RAG_Idx2" ON RAG ("Calculation" COLLATE NOCASE ASC, 
> "Text" COLLATE NOCASE ASC);
> CREATE INDEX "Rag_Idx3" ON RAG ("Calculation" COLLATE NOCASE ASC, 
> "BayesAttributes" COLLATE NOCASE ASC);
> CREATE UNIQUE INDEX "Rag_Idx4" ON RAG ("Id" COLLATE NOCASE ASC, "Calculation" 
> COLLATE NOCASE ASC, "Peak" COLLATE NOCASE ASC, "Red" COLLATE NOCASE ASC, 
> "Amber" COLLATE NOCASE ASC, "Green" COLLATE NOCASE ASC, "Text" COLLATE NOCASE 
> ASC);
> 
> The table has approximately 43,000 rows in it and doesn?t grow very much now. 
> We don?t have complex keys linking things together, we have a simple table.
> 
> Our first thought was that the Perl program was in the way
> 
> We made sure we prepared statements in advance, we analysed the indexes and 
> rebuilt them, we checked the query plan again and did normal simple 
> housekeeping
> 
> The query plan showed 
> 
> SCAN TABLE RAG USING COVERING INDEX Rag_Idx4
> 
> which we think is OK.
> 
> To try and isolate the problem away from the Perl program, we then generated 
> the 5,000 SQL calls that would be made by the Perl program into a file. e.g
> 
> select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is 
> moving well on diversion.';
> select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'approach 
> with caution.';
> select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is 
> currently flowing well in the area.';
> select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is 
> slow moving past the restrictions.';
> select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is 
> slow moving on approach to closures in both directions.';
> select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'expect 
> delays.';
> 
> ?. similar but slightly different SQL for the next 5,000 lines.
> 
> We then piped that straight into sqlite and timed the output.
> 
> time cat ,test1 | sqlite3 db.sqlite > /dev/null
> 
> real  0m49.610s
> user  0m31.328s
> sys   0m18.272s
> 
> This took around 50 seconds which is faster than the Perl program (122 
> seconds). Since the Perl program has to read in a 10 x 5MB XML file, build up 
> XML structures, walk the XML, extract lines from the XML and do lots of other 
> programming stuff this doesn?t seem unreasonable.  
> 
> Whilst piping a large amount of SQL into sqlite doesn?t feel quite right, 
> we?re struggling with how else we can benchmark this to see if we have a 
> problem or not. All suggestions welcomed. 
> 
> So we have 5,000 selects on a simple database table that we *think* is 
> indexed correctly, that shows that each select takes around 10ms in a raw 
> form and 13ms though Perl (we accept that the additional overhead of 0.03 ms 
> is Perl). This gives us a select speed of approx 80-100 selects per second. 
> We need to look more into the split of real/user/sys time.
> 
> The Sqlite webpage https://www.sqlite.org/speed.html 
> <https://www.sqlite.org/speed.html> Test 7 shows that Sqlite 2.7.6 can do 
> 5,000 selects per second with an index which does seem to indicate that our 
> 80-100 selects/second is a bit slow. Now test 7 on the test page use integers 
> and we are using strings but we are only testing for equality so we hope we 
> could be quite quick. 
> 
> Our test system is a high end Mac Pro running a 4790K CPU at 4.3Ghz with a 
> enterprise level SSD, a Samsung 850 512GB SSD, so thats rather a lot more 
> grunt than the 1.6Ghz Athlon quoted in the test spec :)
> 
> We cannot easily make the system muli-threaded, we have thought of a way to 
> move some of the SQL logic into Perl to speed things up but its a kludge, a 
> hack and thats the nicest thing I can say about it.
> 
> Basically we can look at the sql command ?cat ,test1 | sqlite3 db.sqlite > 
> /dev/null? and ask is running 5,000 lines of a select statement in 50 seconds 
> on the above table with the above indexes performant or not? My feeling is 
> not, but I don?t have the experience to be certain. If not, whats slowing it 
> down?
> 
> Thanks for reading this far, I didn?t want to put a simple ?My DB is slow? 
> type of question out without showing the stuff we have done to try and 
> understand the issue and give you some background. I apologise if I have 
> provided too much.
> 
> Thanks,
> 
> Rob
> _______________________________________________
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to