Re: [sqlite] Improving select distinct performance

Jason Freshwater Thu, 08 Oct 2009 10:04:54 -0700

If this comes out garbled again the answer is java 1.6 on 1.8Ghz windows xp


i get about 500,000 rows/second reading single character values via xerial jdbc

> From: john.elr...@fenestra.com
> Jason Freshwater wrote:
> > 2. Read every record into my application and use a hashtable to
> > do my own select distinct. This is pretty good, about 0.6s to
> > achieve the distinct list (I can count each distinct value
> > along the way which is a useful side effect).
>
> Would you mind my asking what language and what OS you are using? My
> best performance tests have given ~10,000 rows per second on Windows
> based machines, your results indicate two orders of magnitude better.
>
> John

Sure, its worth just clarifying one thing
the sample script I attached generates a ~1M row table, the sample
data I refer to timings of is a 384,000 record csv file with 6 short fields
e.g.
ABCDEF,Z,A,X1,200.0,10.0
per record. So there is maybe a factor of 3 difference in amount of data


I'm running java 1.6 on windows xp using the xerial jdbc driver
which unpacks the native libraries for sqlite when it runs.
1.8GHz core 2 duo, with 4Gb ram (but none of this is memory bound)


if you are interested, output from my test program is as follows
I dont think i have made any silly errors of arithmetic - but you never know!


I use
stat.executeUpdate("pragma synchronous=OFF;");
stat.executeUpdate("pragma count_changes=OFF;");
stat.executeUpdate("pragma journal_mode=OFF;");
stat.executeUpdate("pragma read_uncomitted=OFF;");
stat.executeUpdate("pragma temp_store=MEMORY;");
stat.executeUpdate("pragma cache_size=1280;");
stat.executeUpdate("pragma page_size=4096;");
before doing anything


numbers suggest about 500,000 rows/second reading (last result set below)

---------
# read csv and manually insert via jdbc
insert into sd values (?, ?, ?, ?, ?, ?)
100000
200000
300000
time for 384001 inserts: 6734 ms 57024.2 rows/second


# run select distinct in db and read results
select distinct row_ from sd
ROW_
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
query time for 16 rows = 2531ms


# this just reads the data from all columns# in the select and sums the length
select rowid, row_ from sd order by row_
total data length 2576895
query time for 384000 rows = 3969ms
# read the rows without sorting
select row_ from sd
total data length 384000
query time for 384000 rows = 563ms
# distinct with count
select row_,count(row_) from sd group by 1
ROW_ count(row_)
A 24000
B 24000
C 24000
D 24000
E 24000
F 24000
G 24000
H 24000
I 24000
J 24000
K 24000
L 24000
M 24000
N 24000
O 24000
P 24000
query time for 16 rows = 2578ms

# this is the hashmap distinct with the counts printed below
D = 24000
O = 24000
K = 24000
H = 24000
C = 24000
P = 24000
B = 24000
L = 24000
G = 24000
I = 24000
M = 24000
A = 24000
F = 24000
J = 24000
N = 24000
E = 24000
query time for hash distinct: select row_ from sd: 384000 rows 593ms
total time 14500ms
                                          
_________________________________________________________________
Use Hotmail to send and receive mail from your different email accounts.
http://clk.atdmt.com/UKM/go/167688463/direct/01/
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Improving select distinct performance

Reply via email to