Am 31.10.2013 14:09, schrieb Dominique Devienne:

[Userdefined functions in conjunction with fast Exists-checks
in "Userland" - vs. SQLites built-in indexing in case of In (List)]

I'm not convinced by this. The "real table" can be quite large, several
100's to 100,000's rows (up to 1+ million rows) and col can be the primary
key, or a non-unique "parent" key where many parent keys have about 10 rows
each, and a few have in the 1000's, while the in-list could very small
(down to just 1 element) or quite large (several thousands).

With a function based approach, you are *always* full-scanning the whole
"real" table, no matter the cardinality of the InList operand, and even
with a very fast InList function, this is not going to beat getting 10 PK
rows, or 10 "parent" key rows (e.g. 100 rows to 10,000 rows) via indexes,
especially since these are virtual tables with Boost.MultiIndex unique or
non-unique indexes (i.e. 5x to 10x faster than SQLite's paged B-tree
indexes). It might well beat it if the InList operand cardinality is high,
as in your 40K and 60K testing in a 100K rows table, because an InList
that's 40% or 60% of the whole table is close enough to a full scan that
using a native code set or map test similarly outperforms SQLite's generic
paged B-tree indexes like our Boost.MultiIndex-based indexes.

Of course that's speculation on my part, versus your timed experimentation,
so could well be that I'm wrong. And I'll need to look into this eventually.


You're not wrong - although the UDF-timings in my previous post are
correct - it is true that they will remain (relatively) constant -
even in case we reduce the Count in the CompareList from 40000 to
1000 or 100.

All timings with 100000 records in the "real" table - FullTable-scan
due to using an UDF with a sorting-Dictionary-instance:
36msec (100 items in the compare-list)
43msec (1000 items in the compare-list)
48msec (10000 items in the compare-list)
52msec (40000 items in the compare-list)

The above was no surprise to me, because I'd expected that due to the
FullTable-scans in case of the UDF-approach... what came as a surprise
was the kind of "inverse-lookup" the SQLite-optimizer apparently
performs, when an index exists on the "real" table which provides
the Column-value to compare against the "In"-list.

In my large compare-lists (40000 and 60000) this behaviour didn't
become obvious in the timings whilst with 100 and 1000 items in the
compare-lists there was clearly a difference.

Again, all timings with 100000 records in the "real" table -
the compare-list created beforehand in a tmp-table -
the table- and index-creation not included in the timings
SQL: Select Count(*) from T Where Col in Tmp

No indexes in the whole setup (not on the "real" table T and also
not on the Tmp-Table):
37msec (100 items in the compare-list)
47msec (1000 items in the compare-list)
84msec (10000 items in the compare-list)
136msec (40000 items in the compare-list)

With only an index on the Tmp-Table-Column:
37msec (100 items in the compare-list)
56msec (1000 items in the compare-list)..triple-checked, not an outlier
65msec (10000 items in the compare-list)
77msec (40000 items in the compare-list)

With only an index on the real table (on the compare-value-column):
0.4msec (100 items in the compare-list)
1.9msec (1000 items in the compare-list)
26msec (10000 items in the compare-list)
116msec (40000 items in the compare-list)

With both indexes (on the real table and the tmp-table):
Identical timings to the case above - apparently the index on
the real table was choosen in favour of the tmp-table-index -
which is the correct choice of the optimizer for all compare-list-counts below 30000 or so (since with 40000 the index
on the tmp-table performs clearly faster already).

So, my mistake was to choose too large compare-list-counts in
my first test-setup - otherwise it would have become obvious
that indexes on the original "real" table are indeed worthwhile.

This holds true for compare-listcounts smaller than about
a third of the total records in the original table.
An index on the Tmp-Table which holds the compare-list is
apparently only worthwhile above this compare-count.

The timings against a 100000-records-table in a fulltable-
scan with the UDF (here again - this was on a intel i5 2.8GHz):
36msec (100 items in the compare-list)
43msec (1000 items in the compare-list)
48msec (10000 items in the compare-list)
52msec (40000 items in the compare-list)

are not that bad - and I'd guess (since the COM-SQLite-wrapper
I've used has some more overhead due to the Interface-delegation)
that there's perhaps 5msec to subtract compared with C/C++ UDFs -
and I can also imagine, that a nice Boost-Object can also out-
perform the SortingDictionary I've used (perhaps by 20-40% or so).

So, in a C/C++ setup I'd expect these values for a UDF
with a Boost-object for the exists-checks (rough estimate):
21msec (100 items in the compare-list)
26msec (1000 items in the compare-list)
30msec (10000 items in the compare-list)
34msec (40000 items in the compare-list)

Maybe all these values provide an insight also for others - in what
"regions" the SQLite-In-List functionality (roughly) operates
timing-wise (under somewhat idealized conditions).


Olaf

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to