Re: [sqlite] Improving select distinct performance

Jason Freshwater Thu, 08 Oct 2009 08:47:19 -0700

 <4acdf22b.2080...@fenestra.com>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0



> From: john.elr...@fenestra.com
>> 2. Read every record into my application and use a hashtable to
>> do my own select distinct. This is pretty good=2C about 0.6s to
>> achieve the distinct list (I can count each distinct value
>> along the way which is a useful side effect).
>=20
> Would you mind my asking what language and what OS you are using? My=20
> best performance tests have given ~10=2C000 rows per second on Windows=20
> based machines=2C your results indicate two orders of magnitude better.
>=20
> John

Sure=2C its worth just clarifying one thing
=20
the sample script I attached generates a ~1M row table=2C the sample
data I refer to timings of is a 384=2C000 record csv file with 6 short fiel=
ds
e.g.
=20
ABCDEF=2CZ=2CA=2CX1=2C200.0=2C10.0
=20
per record. So there is maybe a factor of 3 difference in amount of data
=20
=20
I'm running java 1.6 on windows xp using the xerial jdbc driver
which unpacks the native libraries for sqlite when it runs.
=20
1.8GHz core 2 duo=2C with 4Gb ram (but none of this is memory bound)
=20
if you are interested=2C output from my test program is as follows
=20
I dont think i have made any silly errors of arithmetic - but you never kno=
w!
=20
its a java program with various calls to getResultSet()
while(rs.next()){
}
=20
batching for inserts has a measurable effect
=20
I use=20
=20
   stat.executeUpdate("pragma synchronous=3DOFF=3B")=3B
   stat.executeUpdate("pragma count_changes=3DOFF=3B")=3B
   stat.executeUpdate("pragma journal_mode=3DOFF=3B")=3B
   stat.executeUpdate("pragma read_uncomitted=3DOFF=3B")=3B
   stat.executeUpdate("pragma temp_store=3DMEMORY=3B")=3B
   stat.executeUpdate("pragma cache_size=3D1280=3B")=3B
   stat.executeUpdate("pragma page_size=3D4096=3B")=3B
=20
before doing anything
=20
numbers suggest about 500=2C000 rows/second reading (last result set below)
which is consistent with my experience with e.g. oracle
=20
=20
=20
---------
=20
# read csv and manually insert via jdbc
insert into sd values (?=2C ?=2C ?=2C ?=2C ?=2C ?)
100000
200000
300000
time for 384001 inserts: 6734 ms 57024.2 rows/second

# run select distinct in db and read results
select distinct row_ from sd
ROW_=20
A=20
B=20
C=20
D=20
E=20
F=20
G=20
H=20
I=20
J=20
K=20
L=20
M=20
N=20
O=20
P=20
query time for 16 rows =3D 2531ms
=20
# this just reads the data from all columns
# in the select and sums the length
=20
select rowid=2C row_ from sd order by row_
total data length 2576895
query time for 384000 rows =3D 3969ms
=20
=20
# read the rows without sorting=20
select row_ from sd
total data length 384000
query time for 384000 rows =3D 563ms
=20
# distinct with count
select row_=2Ccount(row_) from sd group by 1
ROW_ count(row_)=20
A 24000=20
B 24000=20
C 24000=20
D 24000=20
E 24000=20
F 24000=20
G 24000=20
H 24000=20
I 24000=20
J 24000=20
K 24000=20
L 24000=20
M 24000=20
N 24000=20
O 24000=20
P 24000=20
query time for 16 rows =3D 2578ms
=20
# this is the hashmap distinct with the counts printed below
=20
D =3D 24000
O =3D 24000
K =3D 24000
H =3D 24000
C =3D 24000
P =3D 24000
B =3D 24000
L =3D 24000
G =3D 24000
I =3D 24000
M =3D 24000
A =3D 24000
F =3D 24000
J =3D 24000
N =3D 24000
E =3D 24000
query time for hash distinct: select row_ from sd: 384000 rows 593ms
=20
total time 14500ms
=20
                                          =0A=
_________________________________________________________________=0A=
View your other email accounts from your Hotmail inbox. Add them now.=0A=
http://clk.atdmt.com/UKM/go/167688463/direct/01/=
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Improving select distinct performance

Reply via email to