Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Tommaso Teofili
2014-06-04 9:36 GMT+02:00 Thomas Mueller muel...@adobe.com:

 Hi,

 QueryIndex.getCost: this is actually quite well documented (see the
 Javadocs). But the implementations might not fully follow the contract :-)


this is probably just my opinion but the contract is not much clear; to me
finding the worst-case cost to query with the given filter defines what
should be calculated and the returned cost is a value between 1 (very
fast; lookup of a unique node) and the estimated number of entries to
traverse defines the output range as a function of the number of the,
estimated, number of results that the query would eventually return using
that index but, regardless of existing implementations, my doubt is what
this heuristic function to estimate the traversed entries should look
like in general, but that may be implementation specific, and how that
estimate should be used, should we just return the number of estimated
entries for the cost? So that cost is 1 when we estimate to return 1 entry,
2 when we estimate to return 2 entries, 1000 when we estimate to return
1000 entries?

My other concern on this point is that it's not granted, in my opinion,
that the index returning less entries would be the faster.


 But anyway, I think it's anyway the better to deprecate it and use
 AdvancedQueryIndex, as it has more features (specially important for
 ordered indexes).


it would be ok for me to either deprecate it or improve the semantics of
the cost calculation (e.g. explicitly introduce other metrics to be taken
into account in the cost calculation: local / remote index,


 Currently, both QueryIndex and AdvancedQueryIndex are
 supported, in the future I hope we can switch all index implementations to
 AdvancedQueryIndex. I didn't want to do that just before the 1.0 release
 however.


right, good point.



  get rid for example of the FullTextQueryIndex interface

 The FullTextQueryIndex allows to chose a full-text index for full-text
 queries, if one is available, even if the cost is higher. The problem is
 that only full-text indexes can return the correctly data, if full-text
 constraints are used. If we want to get rid of the FullTextQueryIndex
 interface, we need to address this problem in some other way.


since the FullTextQueryIndex seems to be a marker interface, could we
merge it some way into the AdvancedQueryIndex? Just to make things
simpler.



  should we always select the fastest index ? Especially for full text
 ones this should be in some way configurable.

 There are multiple aspects to this: (a) let the user decide which index
 to use, and (b) synchronous versus async indexes:

 (a) The user should be able to decide which index to use for a certain
 query. There are some problems with that: The index the user has in mind
 might not be available (in a certain environment, or with a later version
 of Oak, because for example the index implementation was replaced, or when
 not using Oak).


right, in this case we can fallback to the current mechanism where the
query engine decides which index to use


 Hardcoding the index to use (in the query) is problematic.
 Relational databases: Oracle supports hints (search for oracle database
 hint). SQLite supports something similar, but there it's actually an
 assertion that an index is available, not a hint:
 http://www.sqlite.org/lang_indexedby.html . My position is that we should
 avoid such a mechanism, and instead improve the query engine, the indexes
 implementations, and the documentation instead.


+1, I would not want to specify the index to be used in the (JCR) query,
the approach I was thinking too was to let the user to be able to give an
order of preference of the index to be used (at a repository level) which
would be respected in case those indexes exist and again that would
fallback to current selection mechanism in case of missing index(es).



 (b) Synchronous versus async indexes: some indexes are updated
 asynchronously, and therefore will not include the very latest additions
 to the content. (Recently removed nodes are not a problem, as the query
 engine will anyway have to check if a node is available). We could let the
 user decide if using an asynchronous index is OK or not.

 For both (a) and (b), one problem is that the JCR spec doesn't allow for
 extensions in the query syntax, so if the user would use select ...
 option async_ok, the query would not work with Jackrabbit 2.x and other
 JCR implementations. Maybe we should create a JCR commons utility method,
 so that one could use:

 QueryManager qm = ...

 Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK);


 The method JcrUtils.createQuery could then use instanceof to decide
 whether it's OK to modify the query (for Oak) or not (non-Oak).


yes, that's an option but if we could avoid having it in the query that
would be better.



  for full text queries for example, one may be interested in having a
 higher recall (more documents matching the query) 

Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Davide Giannella
On 18/06/2014 10:26, Tommaso Teofili wrote:
 it would be ok for me to either deprecate it or improve the semantics
 of the cost calculation (e.g. explicitly introduce other metrics to be
 taken into account in the cost calculation: local / remote index, 
With the IndexPlan.isDelayed() we instruct the query engine if the
current index is asynchronous or not

http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isDelayed%28%29

 since the FullTextQueryIndex seems to be a marker interface, could we
 merge it some way into the AdvancedQueryIndex? Just to make things
 simpler.
I think we already have this aspect in the IndexPlan.isFullText()

http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isFulltextIndex%28%29

Cheers
Davide




Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Thomas Mueller
Hi,

QueryIndex.getCost

my doubt is what
this heuristic function to estimate the traversed entries should look
like in general

Relational databases typically know the number of entries in the index
(total indexed entries), plus the selectivity of a column. See also
http://www.akadia.com/services/ora_index_selectivity.html . That way it's
quite easy to estimate the number of returned values, for a given filter.
I think Lucene and Solr have a way to get those numbers, but I didn't look
into that yet. I guess the problem might be, the cost method needs to be
very fast, otherwise all the queries would be affected. Relational
databases typically have those index statistics fully in memory.

Some databases additionally have a histogram that describes the
distribution of the values. See also
http://www.dba-oracle.com/t_histograms.htm

, but that may be implementation specific

The cost returned by an index should not be implementation specific,
otherwise the cost between different index implementations can't be
compared.

, and how that
estimate should be used, should we just return the number of estimated
entries for the cost? So that cost is 1 when we estimate to return 1
entry,
2 when we estimate to return 2 entries, 1000 when we estimate to return
1000 entries?

Generally yes, but if you keep everything in memory or know things are
cached, then you could return less (as documented).

My other concern on this point is that it's not granted, in my opinion,
that the index returning less entries would be the faster.

Yes, it's not that much about less entries or more entries, it's about
lower or higher cost. If there are more entries, but they are all in
memory, then that's better than less entries, but you need a disk read for
each entry. That's what the javadocs talk about.

it would be ok for me to either deprecate it or improve the semantics of
the cost calculation (e.g. explicitly introduce other metrics to be taken
into account in the cost calculation: local / remote index,

The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is
used, the documented behavior should be used: ...and if there could in
theory be one network roundtrip or disk read operation per node (this
method may return a lower number if the data is known to be fully in
memory).

since the FullTextQueryIndex seems to be a marker interface, could we
merge it some way into the AdvancedQueryIndex? Just to make things
simpler.

Yes. The IndexPlan of the AdvancedQueryIndex has a method
isFulltextIndex. Once we can switch to the AdvancedQueryIndex, the
FullTextQueryIndex is no longer needed.

In the end my very generic take on this discussion is that we should try
to
have a clearer API/documentation for the part that involves picking the
index via cost / index plans (but maybe that's only me), implementations
can be then adjusted accordingly if needed.

Yes, sure.

Regards,
Thomas



Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Jukka Zitting
Hi,

On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 should we just return the number of estimated entries for the cost?

Yes, that's what I think the contract should be.

 My other concern on this point is that it's not granted, in my opinion,
 that the index returning less entries would be the faster.

The key concern here isn't the performance of the index, but the
overall performance of the query. If an index returns n paths for a
given query, then the amount of work done by the query engine will be
O(n) as it needs to look up each path and double-check the constraints
on those nodes (or O(n log n) if it also needs to sort the results).
Since I don't expect any reasonable index to perform worse than O(n),
returning n as the cost estimate seems like the best thing to do.

On a related note, I tend to disagree with the additional cost factors
introduced in AdvancedQueryIndex/IndexPlan. In practically all cases
the index lookups will be asymptotically faster than looking up all
the paths returned by the index. Thus I don't think there is value in
trying to make detailed estimates about the cost of the index lookup.

BR,

Jukka Zitting


Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Jukka Zitting
Hi,

On Wed, Jun 18, 2014 at 7:44 AM, Thomas Mueller muel...@adobe.com wrote:
My other concern on this point is that it's not granted, in my opinion,
that the index returning less entries would be the faster.

 Yes, it's not that much about less entries or more entries, it's about
 lower or higher cost. If there are more entries, but they are all in
 memory, then that's better than less entries, but you need a disk read for
 each entry. That's what the javadocs talk about.

There's no way for a query index to know whether a particular node is
cached in memory or not.

BR,

Jukka Zitting


Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Tommaso Teofili
ok, thanks Davide for the pointers.

Regards,
Tommaso


2014-06-18 13:36 GMT+02:00 Davide Giannella giannella.dav...@gmail.com:

 On 18/06/2014 10:26, Tommaso Teofili wrote:
  it would be ok for me to either deprecate it or improve the semantics
  of the cost calculation (e.g. explicitly introduce other metrics to be
  taken into account in the cost calculation: local / remote index,
 With the IndexPlan.isDelayed() we instruct the query engine if the
 current index is asynchronous or not


 http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isDelayed%28%29

  since the FullTextQueryIndex seems to be a marker interface, could we
  merge it some way into the AdvancedQueryIndex? Just to make things
  simpler.
 I think we already have this aspect in the IndexPlan.isFullText()


 http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isFulltextIndex%28%29

 Cheers
 Davide





Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Tommaso Teofili
Hi,

2014-06-18 13:44 GMT+02:00 Thomas Mueller muel...@adobe.com:

 Hi,

 QueryIndex.getCost

 my doubt is what
 this heuristic function to estimate the traversed entries should look
 like in general

 Relational databases typically know the number of entries in the index
 (total indexed entries), plus the selectivity of a column. See also
 http://www.akadia.com/services/ora_index_selectivity.html . That way it's
 quite easy to estimate the number of returned values, for a given filter.
 I think Lucene and Solr have a way to get those numbers, but I didn't look
 into that yet. I guess the problem might be, the cost method needs to be
 very fast, otherwise all the queries would be affected. Relational
 databases typically have those index statistics fully in memory.


ok thanks for clarifying, yes we could get such numbers from Lucene and
Solr.



 Some databases additionally have a histogram that describes the
 distribution of the values. See also
 http://www.dba-oracle.com/t_histograms.htm

 , but that may be implementation specific

 The cost returned by an index should not be implementation specific,
 otherwise the cost between different index implementations can't be
 compared.


exactly, that's one of my main concerns when starting this discussion.
I wonder then if we should abstract the getCost method and just ask the
indexes to return the estimated entries and let the QueryEngine assign a
cost to each of them.



 , and how that
 estimate should be used, should we just return the number of estimated
 entries for the cost? So that cost is 1 when we estimate to return 1
 entry,
 2 when we estimate to return 2 entries, 1000 when we estimate to return
 1000 entries?

 Generally yes, but if you keep everything in memory or know things are
 cached, then you could return less (as documented).

 My other concern on this point is that it's not granted, in my opinion,
 that the index returning less entries would be the faster.

 Yes, it's not that much about less entries or more entries, it's about
 lower or higher cost. If there are more entries, but they are all in
 memory, then that's better than less entries, but you need a disk read for
 each entry. That's what the javadocs talk about.


yes, but I think the problem is that we don't have a defined function
that implements that, so how much should we smoothen the cost if the
entries are all in memory or viceversa increase the cost if there're
network roundtrips? That's what it seems to me to be implementation
specific and could lead to not comparable costs.



 it would be ok for me to either deprecate it or improve the semantics of
 the cost calculation (e.g. explicitly introduce other metrics to be taken
 into account in the cost calculation: local / remote index,

 The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is
 used, the documented behavior should be used: ...and if there could in
 theory be one network roundtrip or disk read operation per node (this
 method may return a lower number if the data is known to be fully in
 memory).

 since the FullTextQueryIndex seems to be a marker interface, could we
 merge it some way into the AdvancedQueryIndex? Just to make things
 simpler.

 Yes. The IndexPlan of the AdvancedQueryIndex has a method
 isFulltextIndex. Once we can switch to the AdvancedQueryIndex, the
 FullTextQueryIndex is no longer needed.


ok



 In the end my very generic take on this discussion is that we should try
 to
 have a clearer API/documentation for the part that involves picking the
 index via cost / index plans (but maybe that's only me), implementations
 can be then adjusted accordingly if needed.

 Yes, sure.


thanks for your comments Thomas.

Regards,
Tommaso



 Regards,
 Thomas




Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Tommaso Teofili
Hi,

2014-06-18 16:02 GMT+02:00 Jukka Zitting jukka.zitt...@gmail.com:

 Hi,

 On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  should we just return the number of estimated entries for the cost?

 Yes, that's what I think the contract should be.


ok, that's different from what Thomas suggests, right? Just entry
estimates, no network roundtrips / asynchronous index penalties, etc.



  My other concern on this point is that it's not granted, in my opinion,
  that the index returning less entries would be the faster.

 The key concern here isn't the performance of the index, but the
 overall performance of the query. If an index returns n paths for a
 given query, then the amount of work done by the query engine will be
 O(n) as it needs to look up each path and double-check the constraints
 on those nodes (or O(n log n) if it also needs to sort the results).
 Since I don't expect any reasonable index to perform worse than O(n),
 returning n as the cost estimate seems like the best thing to do.


ok, under such perspective the index is not returning a cost, but how many
nodes it will provide to the engine, the cost of the query is then a
function of the number of entries.
At the moment node number estimates and performance of the index aspects
seem kind of merged into the getCost.
Then we should probably decouple (at least) the concepts of:
1. how many nodes the index will return for this query (as an estimate)
2. how fast in retrieving the estimated nodes the index is

Even with this distinction we would have to make some choices as given two
indices returning the same number of estimated nodes for the same query, (I
assume) the fastest should be chosen, but if two indices return two
different node number estimates (e.g. that's likely if you have two
different full text indices being able to handle the same query), which one
should be chosen and why?



 On a related note, I tend to disagree with the additional cost factors
 introduced in AdvancedQueryIndex/IndexPlan. In practically all cases
 the index lookups will be asymptotically faster than looking up all
 the paths returned by the index.


true, but then again the same question comes to my mind: how we select the
index to be used? Are we just interested in how many nodes it will return
or how much time it takes to do that plays some role in selection?


 Thus I don't think there is value in
 trying to make detailed estimates about the cost of the index lookup.


Thanks to all, I will probably need some more time to digest all the
thoughts and maybe I'll come back with more questions / ideas.
Regards,
Tommaso



 BR,

 Jukka Zitting



Re: [DISCUSS] - QueryIndex selection

2014-06-18 Thread Jukka Zitting
Hi,

On Wed, Jun 18, 2014 at 11:31 AM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 2014-06-18 16:02 GMT+02:00 Jukka Zitting jukka.zitt...@gmail.com:
 On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  should we just return the number of estimated entries for the cost?

 Yes, that's what I think the contract should be.

 ok, that's different from what Thomas suggests, right? Just entry
 estimates, no network roundtrips / asynchronous index penalties, etc.

Right. I don't believe the cost of the index lookup is significant (at
least in the asymptotic sense) compared to the overall cost of
executing a query.

 ok, under such perspective the index is not returning a cost, but how many
 nodes it will provide to the engine, the cost of the query is then a
 function of the number of entries.

Exactly.

 At the moment node number estimates and performance of the index aspects
 seem kind of merged into the getCost.
 Then we should probably decouple (at least) the concepts of:
 1. how many nodes the index will return for this query (as an estimate)
 2. how fast in retrieving the estimated nodes the index is

I would further argue that point 2 is mostly irrelevant for any decent
index. The only case where I would expect index performance to show up
as a significant factor is when n is small, but the best way to
optimize such queries is probably to just cache the results per query
instead of trying to make informed guesses about expected index
performance.

 Even with this distinction we would have to make some choices as given two
 indices returning the same number of estimated nodes for the same query, (I
 assume) the fastest should be chosen, but if two indices return two
 different node number estimates (e.g. that's likely if you have two
 different full text indices being able to handle the same query), which one
 should be chosen and why?

Unless there are other contributing factors (like preferring a
synchronous index over an asynchronous one, or an explicit preference
by a client), it shouldn't really matter much which one of equally
costly indexes is being selected.

BR,

Jukka Zitting