Re: Effect of indices on SQLite (optimizer) performance

2011-02-08 Thread John Szakmeister
On Sat, Feb 5, 2011 at 2:41 PM, Erik Huelsmann  wrote:
[snip]
> Their FAQ (http://www.sqlite.org/faq.html#q19) sure suggests that it's
> not wise to do separate inserts: the document says SQLite easily does
> 50k inserts per sec into a table on moderate hardware, but only
> roughly 60 transactions per second...
>
> That would surely point into the direction of using transactions when
> we need mass inserts! I'm not sure exactly where in our code these
> inserts should be collected though. Maybe one of the WC-NG regulars
> has an idea?

That has certainly been my experience with SQLite.  I was feeding in
several hundred thousand rows into SQLite (it was "pretend" instrument
data that I wanted to use for performance testing), and inserting them
inside a transaction made the difference between hours and seconds.
In fact, I had started the process before heading to the office, and
when I came back home 10 hours later, it was still running!

-John


Re: Effect of indices on SQLite (optimizer) performance

2011-02-05 Thread Erik Huelsmann
On Sat, Feb 5, 2011 at 8:25 PM, Mark Phippard  wrote:
> On Sat, Feb 5, 2011 at 1:05 PM, Erik Huelsmann  wrote:
>
>> Scenario (2) takes ~0.27 seconds to evaluate in the unmodified
>> database. Adding an index on (wc_id, local_relpath) makes the
>> execution time drop to ~0.000156 seconds!
>>
>>
>> Seems Philip was right :-) We need to carefully review the indices we
>> have in our database to support good performance.
> I wish this document were fully fleshed out, it seems like it has some
> good info in it:
>
> http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
>
> Getting indexes in place for the bulk of our reads is essential.  It
> seems like now would be a good time to make that a priority.  Of
> course adding more indexes will further slow down write speed (which
> seems bad already) so maybe the above document will give ideas for
> other optimizations.
>
> Did anyone see the tests I posted on users@ of a checkout with 5000
> files in single folder?  I really thought we would be faster than 1.6
> already but we are actually several factors slower.
>
> My background is all with DB2 on OS/400.  Something I was looking for
> in SQLite docs is whether it uses hints for the number of rows in a
> table.  For example, DB2 optimizes a new table for 10,000 rows with
> increments of 1,000 when you reach the limit.  If you know you are
> inserting 100,000 rows you can get a massive performance improvement
> by telling DB2 to optimize for a larger size.  I was wondering if
> SQLite was doing something like optimizing for 100 rows or something
> small.  I noticed the end of the checkout is really slow which implies
> it does not insert the rows fast.  Maybe this is just an area where we
> need to use transactions better?

Their FAQ (http://www.sqlite.org/faq.html#q19) sure suggests that it's
not wise to do separate inserts: the document says SQLite easily does
50k inserts per sec into a table on moderate hardware, but only
roughly 60 transactions per second...

That would surely point into the direction of using transactions when
we need mass inserts! I'm not sure exactly where in our code these
inserts should be collected though. Maybe one of the WC-NG regulars
has an idea?


Bye,

Erik.


Re: Effect of indices on SQLite (optimizer) performance

2011-02-05 Thread Mark Phippard
On Sat, Feb 5, 2011 at 1:05 PM, Erik Huelsmann  wrote:

> Scenario (2) takes ~0.27 seconds to evaluate in the unmodified
> database. Adding an index on (wc_id, local_relpath) makes the
> execution time drop to ~0.000156 seconds!
>
>
> Seems Philip was right :-) We need to carefully review the indices we
> have in our database to support good performance.
I wish this document were fully fleshed out, it seems like it has some
good info in it:

http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

Getting indexes in place for the bulk of our reads is essential.  It
seems like now would be a good time to make that a priority.  Of
course adding more indexes will further slow down write speed (which
seems bad already) so maybe the above document will give ideas for
other optimizations.

Did anyone see the tests I posted on users@ of a checkout with 5000
files in single folder?  I really thought we would be faster than 1.6
already but we are actually several factors slower.

My background is all with DB2 on OS/400.  Something I was looking for
in SQLite docs is whether it uses hints for the number of rows in a
table.  For example, DB2 optimizes a new table for 10,000 rows with
increments of 1,000 when you reach the limit.  If you know you are
inserting 100,000 rows you can get a massive performance improvement
by telling DB2 to optimize for a larger size.  I was wondering if
SQLite was doing something like optimizing for 100 rows or something
small.  I noticed the end of the checkout is really slow which implies
it does not insert the rows fast.  Maybe this is just an area where we
need to use transactions better?

Anyway, a big +1 on getting the right indexes in place.  I know SQLite
has an EXPLAIN statement.  Not sure if there are tools you can use to
just capture information and have it tell you the indexes you needed.
On databases like DB2 there are tools like that available and it can
save time.  In fact you could almost remove all indexes and run some
tests to let the db tell you what indexes you needed.  Of course our
test suite probably does not have enough data in the db to make
indexes any faster than a table scan, so you would probably have to do
manual testing using a large working copy to see what you need.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/


Re: Effect of indices on SQLite (optimizer) performance

2011-02-05 Thread Erik Huelsmann
Now attached as text files (to be renamed to .py) to prevent the
mailer software from dropping them...

Bye,

Erik.

On Sat, Feb 5, 2011 at 7:05 PM, Erik Huelsmann  wrote:
> Yesterday or IRC, Bert, Philip and I were chatting about our SQLite
> perf issues and how Philip's findings in the past suggested that
> SQLite wasn't using its indices to optimize our queries.
>
> After searching and discussing its documentation, Philip suggested the
> -too obvious- "maybe we have the wrong indices".
>
> So, I went to work with his "fake database generator script" (attached
> as "test.py").
>
>
> The type of query we're seeing problematic performance with looks like
> the one below. The essential part is the WHERE clause.
>
> SELECT * FROM nodes WHERE wc_id = 1 AND (local_relpath = 'foo' OR
> local_relpath like 'foo%');
>
>
> We discussed 3 ways to achieve the effect of this query:
>
>  1. The query itself
>  2. The query stated as a UNION of two queries
>  3. Running the two parts of the UNION manually ourselves.
>
> Ad (1)
> This query doesn't perform as we had hoped to get from using a database.
>
> Ad (2)
> In the past, UNIONs have been explicitly removed because they were
> creating temporary tables (on disk!). However, since then we have
> changed our SQLite setup to create temporary tables in memory, so the
> option should really be re-evaluated.
>
> Ad (3)
> I'd hate to have to use two queries in all places in our source where
> we want to run queries like these. As a result, I think this scenario
> should be avoided if we can.
>
>
> So, I've created 'perf.py' to evaluate each of these scenarios,
> researching the effect on each of them under the influence of adding
> different indices.
>
> This is my finding:
>
> Scenario (1) [an AND combined with a complex OR] doesn't perform well
> under any circumstance.
>
> Scenario (2) performs differently, depending on the available indices.
>
> Scenario (3) performs roughly equal to scenario (2).
>
>
> Scenario (2) takes ~0.27 seconds to evaluate in the unmodified
> database. Adding an index on (wc_id, local_relpath) makes the
> execution time drop to ~0.000156 seconds!
>
>
> Seems Philip was right :-) We need to carefully review the indices we
> have in our database to support good performance.
>
>
> Bye,
>
>
> Erik.
>
#!/usr/bin/python

import os, sqlite3, time

c = sqlite3.connect('wcx.db')
c.execute("""pragma case_sensitive_like=1""")
c.execute("""pragma foreign_keys=on""")
c.execute("""pragma synchronous=off""")
c.execute("""pragma temp_store=memory""")

start = time.clock() # cpu clock as float in secs

#c.execute("""drop index i_wc_id_rp;""")
#c.execute("""create index i_wc_id_rp on nodes (wc_id, local_relpath);""")

print c.execute(".indices")

# strategy 1
c.execute("""select * from nodes where wc_id = 1 AND
   (local_relpath like 'foo/%'
OR local_relpath = 'foo');""");


# strategy 2
#c.execute("""select * from nodes where wc_id = 1 AND local_relpath like 'foo/%'
# union select * from nodes where wc_id = 1 AND local_relpath = 
'foo';""")

# strategy 3
#c.execute("""select * from nodes where wc_id = 1 AND local_relpath like 
'foo/%';""")
#c.execute("""select * from nodes where wc_id = 1 AND local_relpath = 'foo';""")





end = time.clock()


print "timing: %5f\n" % (end - start)


#!/usr/bin/python

import os, sqlite3

try: os.remove('wcx.db')
except: pass

c = sqlite3.connect('wcx.db')
c.execute("""pragma case_sensitive_like=1""")
c.execute("""pragma foreign_keys=on""")
c.execute("""pragma synchronous=off""")
c.execute("""pragma temp_store=memory""")
c.execute("""create table repository (
   id integer primary key autoincrement,
   root text unique not null,
   uuid text not null)""")
c.execute("""create index i_uuid on repository (uuid)""")
c.execute("""create index i_root on repository (root)""")
c.execute("""create table wcroot (
   id integer primary key autoincrement,
   local_abspath text unique)""")
c.execute("""create unique index i_local_abspath on wcroot (local_abspath)""")
c.execute("""create table nodes (
   wc_id integer not null references wcroot (id),
   local_relpath text not null,
   op_depth integer not null,
   parent_relpath text,
   repos_id integer references repository (id),
   repos_path text,
   revision integer,
   presence text not null,
   depth text,
   moved_here integer,
   moved_to text,
   kind text not null,
   changed_revision integer,
   changed_date integer,
   changed_author text,
   checksum text
   properties blob,
   translated_size integer,
   last_mod_time integer,
   dav_cache blob,
   symlink_target text,
   f

Effect of indices on SQLite (optimizer) performance

2011-02-05 Thread Erik Huelsmann
Yesterday or IRC, Bert, Philip and I were chatting about our SQLite
perf issues and how Philip's findings in the past suggested that
SQLite wasn't using its indices to optimize our queries.

After searching and discussing its documentation, Philip suggested the
-too obvious- "maybe we have the wrong indices".

So, I went to work with his "fake database generator script" (attached
as "test.py").


The type of query we're seeing problematic performance with looks like
the one below. The essential part is the WHERE clause.

SELECT * FROM nodes WHERE wc_id = 1 AND (local_relpath = 'foo' OR
local_relpath like 'foo%');


We discussed 3 ways to achieve the effect of this query:

 1. The query itself
 2. The query stated as a UNION of two queries
 3. Running the two parts of the UNION manually ourselves.

Ad (1)
This query doesn't perform as we had hoped to get from using a database.

Ad (2)
In the past, UNIONs have been explicitly removed because they were
creating temporary tables (on disk!). However, since then we have
changed our SQLite setup to create temporary tables in memory, so the
option should really be re-evaluated.

Ad (3)
I'd hate to have to use two queries in all places in our source where
we want to run queries like these. As a result, I think this scenario
should be avoided if we can.


So, I've created 'perf.py' to evaluate each of these scenarios,
researching the effect on each of them under the influence of adding
different indices.

This is my finding:

Scenario (1) [an AND combined with a complex OR] doesn't perform well
under any circumstance.

Scenario (2) performs differently, depending on the available indices.

Scenario (3) performs roughly equal to scenario (2).


Scenario (2) takes ~0.27 seconds to evaluate in the unmodified
database. Adding an index on (wc_id, local_relpath) makes the
execution time drop to ~0.000156 seconds!


Seems Philip was right :-) We need to carefully review the indices we
have in our database to support good performance.


Bye,


Erik.