Hi, We?re trying to understand whether or not we have a performance problem with our Sqlite database. It may well be that we are ?optimal? for some value of optimal or it may be that we can improve our system. I was hoping to use the experience here to help focus our attention or give some guidance based on real world usage.
It may well be that we don?t have a problem and our performance is appropriate. Thats also a good result as we can stop worrying about things :) This e-mail has quite a lot of information in as we want to give as much information as appropriate. We don?t want to swamp you but we can?t simplify it too much. Background Our database takes a five minute feed of traffic information in London. This dated is XML based and has approximately 500-700 items in it. These items cover everything from a new traffic jam through to three year old roadworks. Yes there are roadworks in London that go back three years! We process this XML feed and pull out chunks of the XML, process it and update our database with it. This database is currently 16GB in size. Our concern is the time taken to process each file every five minutes. It has been getting steadily longer and longer. It started out at around 6 seconds and is now around 22-24 seconds. Whilst we expect processing time to get bigger, we are getting concerned about the performance and decided to have a look. Our XML processing system is written in Perl, which has advantages and disadvantages, Probably the biggest advantage is its forgiving for development and allows us to move quickly with processing significantly large amounts of text within the XML. The biggest disadvantage to use is the lack of multithreading so its limited by the clock speed of the CPU. The database (DB) we use is only ever read and written by a single process, we do not network the database or allow any access apart from processing the XML file and getting an output JSON file. As our processing time has got longer, we decided to have a look at where our Perl program was spending most time. For the interested we use the Perl module Devel::NYTProf. This is a well known and well respected performance profiling tool. http://search.cpan.org/~timb/Devel-NYTProf-6.01/lib/Devel/NYTProf.pm <http://search.cpan.org/~timb/Devel-NYTProf-6.01/lib/Devel/NYTProf.pm> We do not need to adapt our perl code to use it, you simply involve Perl with -d:NTYProd <program_name>. The output is a nice html web page that allows you to drill down into the sections of the code that take the most time up. We ran the program over ten iterations of files to even things out on a 16GB Sqlite database and looked at the output. We were very surprised to see where most of the time was spent, out of a two minute run 70% of the time was spent in an innocuous function. This was odd! We drilled down even further and found that a single select call was occupying 90% of that 70%. The performance profiler was indicating that a select on a table was taking around 13ms/call and we were doing around 5,000 calls in our ten file run. The approx 5,000 calls was about right as there are around 500 distinct chunks of information per file. So the numbers seemed to add up. The issue for us was the 13ms per call. The actual SQL called 5,000 times is ?select Id,Calculation,Peak,Red,Amber,Green from RAG where text = ?? RAG is a Red/Amber/Green table and sentence is a simple sentence that describes road conditions. See below for examples. The RAG table schema is CREATE TABLE "RAG" ( "Id" integer NOT NULL PRIMARY KEY AUTOINCREMENT, "Count" integer NOT NULL DEFAULT 0, "Text" TEXT, "Peak" integer, "Calculation" integer NOT NULL DEFAULT 0, "Red" integer DEFAULT 0, "Amber" integer DEFAULT 0, "Green" integer DEFAULT 0, "BayesAttributes" TEXT ); It has four indexes on it CREATE UNIQUE INDEX "RAG_Idx1" ON RAG ("Text" COLLATE NOCASE ASC); CREATE UNIQUE INDEX "RAG_Idx2" ON RAG ("Calculation" COLLATE NOCASE ASC, "Text" COLLATE NOCASE ASC); CREATE INDEX "Rag_Idx3" ON RAG ("Calculation" COLLATE NOCASE ASC, "BayesAttributes" COLLATE NOCASE ASC); CREATE UNIQUE INDEX "Rag_Idx4" ON RAG ("Id" COLLATE NOCASE ASC, "Calculation" COLLATE NOCASE ASC, "Peak" COLLATE NOCASE ASC, "Red" COLLATE NOCASE ASC, "Amber" COLLATE NOCASE ASC, "Green" COLLATE NOCASE ASC, "Text" COLLATE NOCASE ASC); The table has approximately 43,000 rows in it and doesn?t grow very much now. We don?t have complex keys linking things together, we have a simple table. Our first thought was that the Perl program was in the way We made sure we prepared statements in advance, we analysed the indexes and rebuilt them, we checked the query plan again and did normal simple housekeeping The query plan showed SCAN TABLE RAG USING COVERING INDEX Rag_Idx4 which we think is OK. To try and isolate the problem away from the Perl program, we then generated the 5,000 SQL calls that would be made by the Perl program into a file. e.g select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is moving well on diversion.'; select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'approach with caution.'; select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is currently flowing well in the area.'; select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is slow moving past the restrictions.'; select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is slow moving on approach to closures in both directions.'; select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'expect delays.'; ?. similar but slightly different SQL for the next 5,000 lines. We then piped that straight into sqlite and timed the output. time cat ,test1 | sqlite3 db.sqlite > /dev/null real 0m49.610s user 0m31.328s sys 0m18.272s This took around 50 seconds which is faster than the Perl program (122 seconds). Since the Perl program has to read in a 10 x 5MB XML file, build up XML structures, walk the XML, extract lines from the XML and do lots of other programming stuff this doesn?t seem unreasonable. Whilst piping a large amount of SQL into sqlite doesn?t feel quite right, we?re struggling with how else we can benchmark this to see if we have a problem or not. All suggestions welcomed. So we have 5,000 selects on a simple database table that we *think* is indexed correctly, that shows that each select takes around 10ms in a raw form and 13ms though Perl (we accept that the additional overhead of 0.03 ms is Perl). This gives us a select speed of approx 80-100 selects per second. We need to look more into the split of real/user/sys time. The Sqlite webpage https://www.sqlite.org/speed.html <https://www.sqlite.org/speed.html> Test 7 shows that Sqlite 2.7.6 can do 5,000 selects per second with an index which does seem to indicate that our 80-100 selects/second is a bit slow. Now test 7 on the test page use integers and we are using strings but we are only testing for equality so we hope we could be quite quick. Our test system is a high end Mac Pro running a 4790K CPU at 4.3Ghz with a enterprise level SSD, a Samsung 850 512GB SSD, so thats rather a lot more grunt than the 1.6Ghz Athlon quoted in the test spec :) We cannot easily make the system muli-threaded, we have thought of a way to move some of the SQL logic into Perl to speed things up but its a kludge, a hack and thats the nicest thing I can say about it. Basically we can look at the sql command ?cat ,test1 | sqlite3 db.sqlite > /dev/null? and ask is running 5,000 lines of a select statement in 50 seconds on the above table with the above indexes performant or not? My feeling is not, but I don?t have the experience to be certain. If not, whats slowing it down? Thanks for reading this far, I didn?t want to put a simple ?My DB is slow? type of question out without showing the stuff we have done to try and understand the issue and give you some background. I apologise if I have provided too much. Thanks, Rob