Hi,

We?re trying to understand whether or not we have a performance problem with 
our Sqlite database. It may well be that we are ?optimal? for some value of 
optimal or it may be that we can improve our system. I was hoping to use the 
experience here to help focus our attention or give some guidance based on real 
world usage.

It may well be that we don?t have a problem and our performance is appropriate. 
Thats also a good result as we can stop worrying about things :)

This e-mail has quite a lot of information in as we want to give as much 
information as appropriate. We don?t want to swamp you but we can?t simplify it 
too much.

Background

Our database takes a five minute feed of traffic information in London. This 
dated is XML based and has approximately 500-700 items in it. These items cover 
everything from a new traffic jam through to three year old roadworks. Yes 
there are roadworks in London that go back three years!

We process this XML feed and pull out chunks of the XML, process it and update 
our database with it. This database is currently 16GB in size. Our concern is 
the time taken to process each file every five minutes. It has been getting 
steadily longer and longer. It started out at around 6 seconds and is now 
around 22-24 seconds. Whilst we expect processing time to get bigger, we are 
getting concerned about the performance and decided to have a look.

Our XML processing system is written in Perl, which has advantages and 
disadvantages, Probably the biggest advantage is its forgiving for development 
and allows us to move quickly with processing significantly large amounts of 
text within the XML. The biggest disadvantage to use is the lack of 
multithreading so its limited by the clock speed of the CPU. The database (DB) 
we use is only ever read and written by a single process, we do not network the 
database or allow any access apart from processing the XML file and getting an 
output JSON file.  

As our processing time has got longer, we decided to have a look at where our 
Perl program was spending most time. For the interested we use the Perl module 
Devel::NYTProf. This is a well known and well respected performance profiling 
tool. 

http://search.cpan.org/~timb/Devel-NYTProf-6.01/lib/Devel/NYTProf.pm 
<http://search.cpan.org/~timb/Devel-NYTProf-6.01/lib/Devel/NYTProf.pm>

We do not need to adapt our perl code to use it, you simply involve Perl with 
-d:NTYProd  <program_name>.

The output is a nice html web page that allows you to drill down into the 
sections of the code that take the most time up. We ran the program over ten 
iterations of files to even things out on a 16GB Sqlite database and looked at 
the output. We were very surprised to see where most of the time was spent, out 
of a two minute run 70% of the time was spent in an innocuous function. This 
was odd!

We drilled down even further and found that a single select call was occupying 
90% of that 70%. The performance profiler was indicating that a select on a 
table was taking around 13ms/call and we were doing around 5,000 calls in our 
ten file run. The approx 5,000 calls was about right as there are around 500 
distinct chunks of information per file. So the numbers seemed to add up. The 
issue for us was the 13ms per call.

The actual SQL called 5,000 times is 

?select Id,Calculation,Peak,Red,Amber,Green from RAG where text = ??

RAG is a Red/Amber/Green table and sentence is a simple sentence that describes 
road conditions. See below for examples.

The RAG table schema is 

CREATE TABLE "RAG" (
         "Id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
         "Count" integer NOT NULL DEFAULT 0,
         "Text" TEXT,
         "Peak" integer,
         "Calculation" integer NOT NULL DEFAULT 0,
         "Red" integer DEFAULT 0,
         "Amber" integer DEFAULT 0,
         "Green" integer DEFAULT 0,
         "BayesAttributes" TEXT
);

It has four indexes on it 

CREATE UNIQUE INDEX "RAG_Idx1" ON RAG ("Text" COLLATE NOCASE ASC);
CREATE UNIQUE INDEX "RAG_Idx2" ON RAG ("Calculation" COLLATE NOCASE ASC, "Text" 
COLLATE NOCASE ASC);
CREATE INDEX "Rag_Idx3" ON RAG ("Calculation" COLLATE NOCASE ASC, 
"BayesAttributes" COLLATE NOCASE ASC);
CREATE UNIQUE INDEX "Rag_Idx4" ON RAG ("Id" COLLATE NOCASE ASC, "Calculation" 
COLLATE NOCASE ASC, "Peak" COLLATE NOCASE ASC, "Red" COLLATE NOCASE ASC, 
"Amber" COLLATE NOCASE ASC, "Green" COLLATE NOCASE ASC, "Text" COLLATE NOCASE 
ASC);

The table has approximately 43,000 rows in it and doesn?t grow very much now. 
We don?t have complex keys linking things together, we have a simple table.

Our first thought was that the Perl program was in the way

We made sure we prepared statements in advance, we analysed the indexes and 
rebuilt them, we checked the query plan again and did normal simple housekeeping

The query plan showed 

SCAN TABLE RAG USING COVERING INDEX Rag_Idx4

which we think is OK.

To try and isolate the problem away from the Perl program, we then generated 
the 5,000 SQL calls that would be made by the Perl program into a file. e.g

select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is 
moving well on diversion.';
select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'approach with 
caution.';
select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is 
currently flowing well in the area.';
select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is 
slow moving past the restrictions.';
select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'traffic is 
slow moving on approach to closures in both directions.';
select Id,Calculation,Peak,Red,Amber,Green from RAG where text = 'expect 
delays.';

?. similar but slightly different SQL for the next 5,000 lines.

We then piped that straight into sqlite and timed the output.

time cat ,test1 | sqlite3 db.sqlite > /dev/null

real    0m49.610s
user    0m31.328s
sys     0m18.272s

This took around 50 seconds which is faster than the Perl program (122 
seconds). Since the Perl program has to read in a 10 x 5MB XML file, build up 
XML structures, walk the XML, extract lines from the XML and do lots of other 
programming stuff this doesn?t seem unreasonable.  

Whilst piping a large amount of SQL into sqlite doesn?t feel quite right, we?re 
struggling with how else we can benchmark this to see if we have a problem or 
not. All suggestions welcomed. 

So we have 5,000 selects on a simple database table that we *think* is indexed 
correctly, that shows that each select takes around 10ms in a raw form and 13ms 
though Perl (we accept that the additional overhead of 0.03 ms is Perl). This 
gives us a select speed of approx 80-100 selects per second. We need to look 
more into the split of real/user/sys time.

The Sqlite webpage https://www.sqlite.org/speed.html 
<https://www.sqlite.org/speed.html> Test 7 shows that Sqlite 2.7.6 can do 5,000 
selects per second with an index which does seem to indicate that our 80-100 
selects/second is a bit slow. Now test 7 on the test page use integers and we 
are using strings but we are only testing for equality so we hope we could be 
quite quick. 

Our test system is a high end Mac Pro running a 4790K CPU at 4.3Ghz with a 
enterprise level SSD, a Samsung 850 512GB SSD, so thats rather a lot more grunt 
than the 1.6Ghz Athlon quoted in the test spec :)

We cannot easily make the system muli-threaded, we have thought of a way to 
move some of the SQL logic into Perl to speed things up but its a kludge, a 
hack and thats the nicest thing I can say about it.

Basically we can look at the sql command ?cat ,test1 | sqlite3 db.sqlite > 
/dev/null? and ask is running 5,000 lines of a select statement in 50 seconds 
on the above table with the above indexes performant or not? My feeling is not, 
but I don?t have the experience to be certain. If not, whats slowing it down?

Thanks for reading this far, I didn?t want to put a simple ?My DB is slow? type 
of question out without showing the stuff we have done to try and understand 
the issue and give you some background. I apologise if I have provided too much.

Thanks,

Rob

Reply via email to