<4acdf22b.2080...@fenestra.com> Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0
> From: john.elr...@fenestra.com >> 2. Read every record into my application and use a hashtable to >> do my own select distinct. This is pretty good=2C about 0.6s to >> achieve the distinct list (I can count each distinct value >> along the way which is a useful side effect). >=20 > Would you mind my asking what language and what OS you are using? My=20 > best performance tests have given ~10=2C000 rows per second on Windows=20 > based machines=2C your results indicate two orders of magnitude better. >=20 > John Sure=2C its worth just clarifying one thing =20 the sample script I attached generates a ~1M row table=2C the sample data I refer to timings of is a 384=2C000 record csv file with 6 short fiel= ds e.g. =20 ABCDEF=2CZ=2CA=2CX1=2C200.0=2C10.0 =20 per record. So there is maybe a factor of 3 difference in amount of data =20 =20 I'm running java 1.6 on windows xp using the xerial jdbc driver which unpacks the native libraries for sqlite when it runs. =20 1.8GHz core 2 duo=2C with 4Gb ram (but none of this is memory bound) =20 if you are interested=2C output from my test program is as follows =20 I dont think i have made any silly errors of arithmetic - but you never kno= w! =20 its a java program with various calls to getResultSet() while(rs.next()){ } =20 batching for inserts has a measurable effect =20 I use=20 =20 stat.executeUpdate("pragma synchronous=3DOFF=3B")=3B stat.executeUpdate("pragma count_changes=3DOFF=3B")=3B stat.executeUpdate("pragma journal_mode=3DOFF=3B")=3B stat.executeUpdate("pragma read_uncomitted=3DOFF=3B")=3B stat.executeUpdate("pragma temp_store=3DMEMORY=3B")=3B stat.executeUpdate("pragma cache_size=3D1280=3B")=3B stat.executeUpdate("pragma page_size=3D4096=3B")=3B =20 before doing anything =20 numbers suggest about 500=2C000 rows/second reading (last result set below) which is consistent with my experience with e.g. oracle =20 =20 =20 --------- =20 # read csv and manually insert via jdbc insert into sd values (?=2C ?=2C ?=2C ?=2C ?=2C ?) 100000 200000 300000 time for 384001 inserts: 6734 ms 57024.2 rows/second # run select distinct in db and read results select distinct row_ from sd ROW_=20 A=20 B=20 C=20 D=20 E=20 F=20 G=20 H=20 I=20 J=20 K=20 L=20 M=20 N=20 O=20 P=20 query time for 16 rows =3D 2531ms =20 # this just reads the data from all columns # in the select and sums the length =20 select rowid=2C row_ from sd order by row_ total data length 2576895 query time for 384000 rows =3D 3969ms =20 =20 # read the rows without sorting=20 select row_ from sd total data length 384000 query time for 384000 rows =3D 563ms =20 # distinct with count select row_=2Ccount(row_) from sd group by 1 ROW_ count(row_)=20 A 24000=20 B 24000=20 C 24000=20 D 24000=20 E 24000=20 F 24000=20 G 24000=20 H 24000=20 I 24000=20 J 24000=20 K 24000=20 L 24000=20 M 24000=20 N 24000=20 O 24000=20 P 24000=20 query time for 16 rows =3D 2578ms =20 # this is the hashmap distinct with the counts printed below =20 D =3D 24000 O =3D 24000 K =3D 24000 H =3D 24000 C =3D 24000 P =3D 24000 B =3D 24000 L =3D 24000 G =3D 24000 I =3D 24000 M =3D 24000 A =3D 24000 F =3D 24000 J =3D 24000 N =3D 24000 E =3D 24000 query time for hash distinct: select row_ from sd: 384000 rows 593ms =20 total time 14500ms =20 =0A= _________________________________________________________________=0A= View your other email accounts from your Hotmail inbox. Add them now.=0A= http://clk.atdmt.com/UKM/go/167688463/direct/01/= _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users