Re: [DISCUSS] - QueryIndex selection

Thomas Mueller Wed, 18 Jun 2014 04:45:54 -0700

Hi,

>>QueryIndex.getCost


>my doubt is what
>this heuristic function to estimate the "traversed entries" should look
>like in general

Relational databases typically know the number of entries in the index
(total indexed entries), plus the selectivity of a column. See also
http://www.akadia.com/services/ora_index_selectivity.html . That way it's
quite easy to estimate the number of returned values, for a given filter.
I think Lucene and Solr have a way to get those numbers, but I didn't look
into that yet. I guess the problem might be, the cost method needs to be
very fast, otherwise all the queries would be affected. Relational
databases typically have those index statistics fully in memory.

Some databases additionally have a histogram that describes the
distribution of the values. See also
http://www.dba-oracle.com/t_histograms.htm

>, but that may be implementation specific

The cost returned by an index should not be implementation specific,
otherwise the cost between different index implementations can't be
compared.

>, and how that
>estimate should be used, should we just return the number of estimated
>entries for the cost? So that cost is 1 when we estimate to return 1
>entry,
>2 when we estimate to return 2 entries, 1000 when we estimate to return
>1000 entries?

Generally yes, but if you keep everything in memory or know things are
cached, then you could return less (as documented).

>My other concern on this point is that it's not granted, in my opinion,
>that the index returning less entries would be the faster.

Yes, it's not that much about less entries or more entries, it's about
lower or higher cost. If there are more entries, but they are all in
memory, then that's better than less entries, but you need a disk read for
each entry. That's what the javadocs talk about.

>it would be ok for me to either deprecate it or improve the semantics of
>the cost calculation (e.g. explicitly introduce other metrics to be taken
>into account in the cost calculation: local / remote index,

The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is
used, the documented behavior should be used: "...and if there could in
theory be one network roundtrip or disk read operation per node (this
method may return a lower number if the data is known to be fully in
memory)."

>since the FullTextQueryIndex seems to be a marker interface, could we
>"merge" it some way into the AdvancedQueryIndex? Just to make things
>simpler.

Yes. The IndexPlan of the AdvancedQueryIndex has a method
"isFulltextIndex". Once we can switch to the AdvancedQueryIndex, the
FullTextQueryIndex is no longer needed.

>In the end my very generic take on this discussion is that we should try
>to
>have a clearer API/documentation for the part that involves picking the
>index via cost / index plans (but maybe that's only me), implementations
>can be then adjusted accordingly if needed.

Yes, sure.

Regards,
Thomas

Re: [DISCUSS] - QueryIndex selection

Reply via email to