[sqlite] Improving select distinct performance

Jason Freshwater Thu, 08 Oct 2009 04:00:24 -0700
Hi (Apologies for the long post),
 
I am using sqlite for storing  application data and settings.
A typical simple scenario is to import a CSV file into sqlite and
then present the user with various summaries of the data
that it contains.
 
Generally everything works really well; I can insert data at about
60,000 rows per second for a simple 6 values per row CSV having
tuned the various pragma settings. General query performance is
also good.
 
The problem I keep running into is with expressions of the form 
 
      select distinct c from t1
 
(see end of script for self contained example that generates a suitable test 
database – I also realise this topic has 
created quite a bit of discussion in the past).
 
For a 1 million row table with 8 distinct values of c the query
time on my system (pretty typical modern laptop using precompiled
sqlite 3.6.18) is about 2.8 seconds. For my requirements this is
a bit too slow...
 
A query such as
 
      select min(c) from t1
 
takes about 0.4 seconds. So it seems that visiting each row in
the table can be done pretty quickly.
 
I’d really like to come up with a method for improving the
performance of select distinct (that doesn’t involve indexing).
I’ve seen quite a bit of discussion about this in various places
and just wanted to make sure that I hadn’t missed something.
Things I have tried
 
1. Create an index
This improves performance for the query but at the expense of
taking just about as long to create the index for each column
as to load the data set.
 
2. Read every record into my application and use a hashtable to
do my own select distinct. This is pretty good, about 0.6s to
achieve the distinct list (I can count each distinct value
along the way which is a useful side effect).
 
3. Create a new aggregation function which behaves like
group_concat, but only records entries that are new. This
is also pretty good, about 0.4s for my naive implementation.
And I can put counts in as well with just an extra bit of
effort to give me effectively 
         select c, count(c) from t1 group by 1
without the overhead of indexing. You get name count output like
         D:456,A:12,B:234,G:67897,E:3434
 
So I have a number of questions:
 
1. Have I missed obvious alternatives?
 
2. Are there any plans to try and improve the situation? I
realise Oracle and Sqlite are wildly different applications,
but Oracle runs the same select distinct queries in a fraction
of the time. You can run the sample below in oracle via
sqlplus (just ignore the errors about the sqlite commands that
it doesn’t understand). Sqlite is faster for just about
everything else.
 
3. Would there be any possibility to get new aggregation
functions (like the ones I describe above) into sqlite. I’m
really keen to be able to use something like the Xerial jdbc interface without 
having to maintain my own extension libraries
on 3 different platforms.
 
I realise that method 2 (hash via jdbc) gives pretty good
results but its sort of inconvenient as a general solution.
 
I guess that some sort of on the fly hash based index for
unindexed columns would be a general solution for this and
many other problems (I saw that discussed briefly elsewhere).
 
Here is a sample sql script that will create approx 1 million
row table with a single column populated with the values 
A,B,C,D,E,F,G,H.
 
Any thoughts, suggestions, pointers to the FAQ I missed etc are welcome.
 
Cheers,
 
Jason
 
run with
 
sqlite3  < test_distinct.sql
 
or
 
sqlplus user/pass @ test_distinct.sql
 
----------------test_distinct.sql-------------------------
 
 
-- sqlite
pragma synchronous=OFF;
pragma count_changes=OFF;
pragma journal_mode=OFF;
pragma read_uncomitted=OFF;
pragma temp_store=MEMORY;
pragma cache_size=4096;
pragma page_size=1024;
 
drop table t1;
--drop table if exists t1;
 
create table t1 (c varchar2(1000));
insert into t1(c) values('A');
insert into t1(c) values('B');
insert into t1(c) values('C');
insert into t1(c) values('D');
insert into t1(c) values('E');
insert into t1(c) values('F');
insert into t1(c) values('G');
insert into t1(c) values('H');
-- duplicate up the values to about 1 million rows
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
 
-- sqlite
.timer ON
.echo ON
-- oracle
set timing on;
set echo on;
 
select count(*) from t1;
select distinct c from t1;
--explain query plan select distinct c from t1;
select c from t1 group by c;
select c, count(c) from t1 group by c;
select min(c) from t1;
 
-- now test the indexing route
 
create index t1_c_idx on t1(c);
analyze t1;
select distinct c from t1;
select c from t1 group by c;
select min(c) from t1;
select c, count(c) from t1 group by c;
                                          
_________________________________________________________________
Learn how to add other email accounts to Hotmail in 3 easy steps.
http://clk.atdmt.com/UKM/go/167688463/direct/01/
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Improving select distinct performance

Reply via email to