Re: [DISCUSS] - QueryIndex selection

Tommaso Teofili Wed, 18 Jun 2014 01:27:46 -0700

2014-06-04 9:36 GMT+02:00 Thomas Mueller <muel...@adobe.com>:

> Hi,
>
> QueryIndex.getCost: this is actually quite well documented (see the
> Javadocs). But the implementations might not fully follow the contract :-)
>


this is probably just my opinion but the contract is not much clear; to me
finding "the worst-case cost to query with the given filter" defines what
should be calculated and "the returned cost is a value between 1 (very
fast; lookup of a unique node) and the estimated number of entries to
traverse" defines the output range as a function of the number of the,
estimated, number of results that the query would eventually return using
that index but, regardless of existing implementations, my doubt is what
this heuristic function to estimate the "traversed entries" should look
like in general, but that may be implementation specific, and how that
estimate should be used, should we just return the number of estimated
entries for the cost? So that cost is 1 when we estimate to return 1 entry,
2 when we estimate to return 2 entries, 1000 when we estimate to return
1000 entries?

My other concern on this point is that it's not granted, in my opinion,
that the index returning less entries would be the faster.


> But anyway, I think it's anyway the better to deprecate it and use
> AdvancedQueryIndex, as it has more features (specially important for
> ordered indexes).


it would be ok for me to either deprecate it or improve the semantics of
the cost calculation (e.g. explicitly introduce other metrics to be taken
into account in the cost calculation: local / remote index,


> Currently, both QueryIndex and AdvancedQueryIndex are
> supported, in the future I hope we can switch all index implementations to
> AdvancedQueryIndex. I didn't want to do that just before the 1.0 release
> however.
>

right, good point.


>
> > get rid for example of the FullTextQueryIndex interface
>
> The FullTextQueryIndex allows to chose a full-text index for full-text
> queries, if one is available, even if the cost is higher. The problem is
> that only full-text indexes can return the correctly data, if full-text
> constraints are used. If we want to get rid of the FullTextQueryIndex
> interface, we need to address this problem in some other way.
>

since the FullTextQueryIndex seems to be a marker interface, could we
"merge" it some way into the AdvancedQueryIndex? Just to make things
simpler.


>
> > should we always select the fastest index ? Especially for full text
> >ones this should be in some way configurable.
>
> There are multiple aspects to this: (a) "let the user decide which index
> to use", and (b) "synchronous versus async indexes":
>
> (a) The user should be able to decide which index to use for a certain
> query. There are some problems with that: The index the user has in mind
> might not be available (in a certain environment, or with a later version
> of Oak, because for example the index implementation was replaced, or when
> not using Oak).


right, in this case we can fallback to the current mechanism where the
query engine decides which index to use


> Hardcoding the index to use (in the query) is problematic.
> Relational databases: Oracle supports "hints" (search for "oracle database
> hint"). SQLite supports something similar, but there it's actually an
> assertion that an index is available, not a hint:
> http://www.sqlite.org/lang_indexedby.html . My position is that we should
> avoid such a mechanism, and instead improve the query engine, the indexes
> implementations, and the documentation instead.
>

+1, I would not want to specify the index to be used in the (JCR) query,
the approach I was thinking too was to let the user to be able to give an
order of preference of the index to be used (at a repository level) which
would be respected in case those indexes exist and again that would
fallback to current selection mechanism in case of missing index(es).


>
> (b) Synchronous versus async indexes: some indexes are updated
> asynchronously, and therefore will not include the very latest additions
> to the content. (Recently removed nodes are not a problem, as the query
> engine will anyway have to check if a node is available). We could let the
> user decide if using an asynchronous index is OK or not.
>
> For both (a) and (b), one problem is that the JCR spec doesn't allow for
> extensions in the query syntax, so if the user would use "select ...
> option async_ok", the query would not work with Jackrabbit 2.x and other
> JCR implementations. Maybe we should create a JCR commons utility method,
> so that one could use:
>
>     QueryManager qm = ...
>
>     Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK);
>
>
> The method JcrUtils.createQuery could then use "instanceof" to decide
> whether it's OK to modify the query (for Oak) or not (non-Oak).
>

yes, that's an option but if we could avoid having it in the query that
would be better.


>
> > for full text queries for example, one may be interested in having a
> >higher recall (more documents matching the query) which may eventually
> >lead to a slightly slower query execution / higher cost evaluation
>
> That would also need to be specified by the user in some way, right? For
> example in the query itself? We could use a similar mechanism than
> ASYNC_OK above, so the application would still work for Jackrabbit 2.x.
>

I am not sure, I would like to avoid having to specify such things in the
query (as per the index to be used).
With the current implementation one reliable way to define the index to be
used is (or should be) to use native query language [1].

In the end my very generic take on this discussion is that we should try to
have a clearer API/documentation for the part that involves picking the
index via cost / index plans (but maybe that's only me), implementations
can be then adjusted accordingly if needed.

Regards,
Tommaso

[1] : http://jackrabbit.apache.org/oak/docs/query.html#Native_Queries


> Regards,
> Thomas
>
>
>
>
>
>
>
>
>
> On 26/05/14 10:25, "Tommaso Teofili" <tommaso.teof...@gmail.com> wrote:
>
> >Hi all,
> >
> >I'd like to start discussing how we may improve / simplify current way of
> >selecting a query engine to use for a certain query.
> >
> >In the QueryIndex interface we have the plain old getCost method which
> >selects the index returning the lower cost for the given query but,
> >recently, also an AdvancedQueryIndex interface has been introduced which,
> >if I understood things correctly, uses the IndexPlan(s) returned by each
> >query index for the given query to select which one has to be used.
> >So I would like to discuss if it's possible to clean up things a bit in
> >order to have a unified query selection mechanism.
> >
> >At the moment, in my opinion, one problem with the getCost() method is
> >that
> >it inherently merges the following topics:
> >- index capability to handle a certain query (can the QueryIndex handle
> >that query?)
> >- index efficiency in handling a certain query (how fast will the
> >QueryIndex will be in handling that query?)
> >
> >Also the efficiency is not evaluated on a "cost model", each QueryIndex
> >implementation can return an arbitrary different number; on one hand this
> >is ok as it allows to take very index specific constraint into account: on
> >the other hand if one has to write a new QueryIndex implementation he/she
> >will have to look into each other query index implementation to understand
> >(and design) if / when its index is picked up; and even with already
> >existing indexes it's not easy to say upfront which one will be selected
> >(e.g. for debugging purposes).
> >
> >With the AdvancedQueryIndex, if I understood it correctly (I just had a
> >look at it on Friday), a QueryIndex is selected upon its IndexPlan, which
> >is supposed to address better both the cost (as it explicitly exposes the
> >cost per execution, cost per entry and estimated entry count metrics) and
> >the query index capability to handle a certain query (e.g. this is used
> >for
> >ordered property index).
> >However, at the moment, only the OrderedPropertyIndex is using it so I
> >think it'd be good to decide if we want to go further with the
> >AdvancedQueryIndex also for the other QueryIndex implementations (and get
> >rid for example of the FullTextQueryIndex interface as it seems useless to
> >me) or not.
> >
> >One final question on query index selection, should we always select the
> >fastest index ?
> >Especially for full text ones this should be in some way configurable.
> >
> >What do others think?
> >Regards,
> >Tommaso
> >
> >p.s.:
> >As discussed also offline last week with some other folks maybe one
> >further
> >metric to be taken into consideration for the index selection is if the
> >index is synchronous or not
>
>

Re: [DISCUSS] - QueryIndex selection

Reply via email to