On Tue, 19 May 2015 20:44:17 +0000
Eric Hill <Eric.Hill at jmp.com> wrote:

> But then what about a query like this:
> 
>       SELECT * FROM T1
>               LEFT OUTER JOIN T2 ON ( T2.a = T1.a ) AND ( T2.b =
> T1.b ) AND ( T2.c = T1.c );
> 
> xBestIndex will get called here for T1 with 3 constraints, c, b, and
> a, in that order.  In this case, though, it seems (to the
> uninitiated, at least ;-) that the "best index" would be:
> 
>       CREATE INDEX T1_all ON T1 (c, b, a);

Given that query, any index that includes a, b, and c would serve
equally well.  The order of the criteria in the ON clause is
immaterial.  

> I guess xBestIndex is saying "Tell me about indexes that you already
> have available so I can take advantage of them", 

Right.  

> I have *no* indexes, but I am willing to make whatever indexes would
> be most helpful, if I could just figure that out.

Depending on the physical characteristics of thing you're
representing as a virtual table, you might not want any indexes. If
every access requires either a sequential scan or a random access to
computed location, all search criteria are either perfect or
inconsequential.  If that's fast enough, it's fast enough.  If you want
to make it faster, indexes are certainly an option.  

Don't start by guessing, though.  Rather than beginning by trying to
anticipate what the query-generation tool will produce (and thus what
xBestIndex combinations will be interrogated), it's better to put the
horse before the cart and consider the data.  Once you've characterized
your data, you apply that information to that supplied by xBestIndex.  

If you have a table in 1NF, you have a key.  What are the columns that
identify a row?  That's your primary key, and your first index. There
may be other sets of columns that uniquely identify a row; these also
could use an index to enforce/verify uniqueness.  

Suppose your primary key is {a,b,c} and you want to construct an index
for it.  At this point I start will talking out of school, because I
don't know anything about the SQLite query planner.  But if I go astray
I'm sure others will correct me.  

The choice of the first column in the index is the main concern, and it
is influenced by the dominant search criteria i.e., the kinds of queries
that will be submitted.  

Queries in general are of two kinds: point queries, returning one row
(specifying values for every column in the index), and range queries
(specifying an incomplete set of values for the index).  Point queries
are indifferent to column order: find row in index and return. 
Range queries are the interesting ones.  

In a B+ tree such as SQLite uses for its indexes, rows will be sorted
in column order.  If your index is {a,b,c}, all the "a" rows will be
together and all the "b" columns within those "a"'s, and so on.  When
the query says, 

        where a between x and y

the I/O engine can scoop all the row references out of contiguous
pages in the index.  Whereas if the query says, 

        where b between x and y

the index is all but useless.  Because it's sorted by "a", it would have
to be scanned from beginning to end to find all the "b"'s that meet the
criteria.  

The other important criterion to answering xBestIndex, as Hick
mentioned, is cardinality.  You can create for each column a histogram
of the frequency distribution for the values in that column.  If the
column domain is two values (e.g., 'male' and 'female'), for example,
the probability of it matching a queried value X might be 50%.  (I think
I read that at Google 4 employees in 5 are men, in which case there a
value of 'female' would be a much more useful 20%.)  You can use the
combination of selective cardinality and your intuition (or experience)
for probable search criteria to create other indexes.  Your answer to
xBestIndex will be based on the product of the selectivity of the
contiguous columns (starting from the first) in the applicable index.  

Cardinality is only one part of the equation, however; the other is the
operator.  Equality is at least an opportunity for a point query, but
inequality implies a range.  How big a range?  If we're talking about
ages, for example, 

        where age > x

If X is 0, that's not very useful, whereas if X is 100, it's very
selective of human beings (assuming age is in years).  Problem is, X
isn't supplied (for good reason) to xBestIndex.  How to answer?  

You can only work with what you know: the cardinality of the column,
the operator, and the rowcount.  Rows and cardinality you know, and
they apply directly to equality comparisons.  For inequality, you need
a heuristic.  A quarter of a century ago, during the Late Bronz Age,
Sybase used a simple heuristic: every inequality selected 25% of the
table.  That meant a BETWEEN query (two inequalities) was 6.25% = 0.25
* 0.25.  I doubt they pulled that number out of a hat.  Absent better
information, that might be your best option.  

I hope that's of some use.  Complicated, for sure, and no small amount
of work.  Welcome to the woolly world of query optimization.  

--jkl


Reply via email to