Re: [DISCUSS] - QueryIndex selection
2014-06-04 9:36 GMT+02:00 Thomas Mueller muel...@adobe.com: Hi, QueryIndex.getCost: this is actually quite well documented (see the Javadocs). But the implementations might not fully follow the contract :-) this is probably just my opinion but the contract is not much clear; to me finding the worst-case cost to query with the given filter defines what should be calculated and the returned cost is a value between 1 (very fast; lookup of a unique node) and the estimated number of entries to traverse defines the output range as a function of the number of the, estimated, number of results that the query would eventually return using that index but, regardless of existing implementations, my doubt is what this heuristic function to estimate the traversed entries should look like in general, but that may be implementation specific, and how that estimate should be used, should we just return the number of estimated entries for the cost? So that cost is 1 when we estimate to return 1 entry, 2 when we estimate to return 2 entries, 1000 when we estimate to return 1000 entries? My other concern on this point is that it's not granted, in my opinion, that the index returning less entries would be the faster. But anyway, I think it's anyway the better to deprecate it and use AdvancedQueryIndex, as it has more features (specially important for ordered indexes). it would be ok for me to either deprecate it or improve the semantics of the cost calculation (e.g. explicitly introduce other metrics to be taken into account in the cost calculation: local / remote index, Currently, both QueryIndex and AdvancedQueryIndex are supported, in the future I hope we can switch all index implementations to AdvancedQueryIndex. I didn't want to do that just before the 1.0 release however. right, good point. get rid for example of the FullTextQueryIndex interface The FullTextQueryIndex allows to chose a full-text index for full-text queries, if one is available, even if the cost is higher. The problem is that only full-text indexes can return the correctly data, if full-text constraints are used. If we want to get rid of the FullTextQueryIndex interface, we need to address this problem in some other way. since the FullTextQueryIndex seems to be a marker interface, could we merge it some way into the AdvancedQueryIndex? Just to make things simpler. should we always select the fastest index ? Especially for full text ones this should be in some way configurable. There are multiple aspects to this: (a) let the user decide which index to use, and (b) synchronous versus async indexes: (a) The user should be able to decide which index to use for a certain query. There are some problems with that: The index the user has in mind might not be available (in a certain environment, or with a later version of Oak, because for example the index implementation was replaced, or when not using Oak). right, in this case we can fallback to the current mechanism where the query engine decides which index to use Hardcoding the index to use (in the query) is problematic. Relational databases: Oracle supports hints (search for oracle database hint). SQLite supports something similar, but there it's actually an assertion that an index is available, not a hint: http://www.sqlite.org/lang_indexedby.html . My position is that we should avoid such a mechanism, and instead improve the query engine, the indexes implementations, and the documentation instead. +1, I would not want to specify the index to be used in the (JCR) query, the approach I was thinking too was to let the user to be able to give an order of preference of the index to be used (at a repository level) which would be respected in case those indexes exist and again that would fallback to current selection mechanism in case of missing index(es). (b) Synchronous versus async indexes: some indexes are updated asynchronously, and therefore will not include the very latest additions to the content. (Recently removed nodes are not a problem, as the query engine will anyway have to check if a node is available). We could let the user decide if using an asynchronous index is OK or not. For both (a) and (b), one problem is that the JCR spec doesn't allow for extensions in the query syntax, so if the user would use select ... option async_ok, the query would not work with Jackrabbit 2.x and other JCR implementations. Maybe we should create a JCR commons utility method, so that one could use: QueryManager qm = ... Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK); The method JcrUtils.createQuery could then use instanceof to decide whether it's OK to modify the query (for Oak) or not (non-Oak). yes, that's an option but if we could avoid having it in the query that would be better. for full text queries for example, one may be interested in having a higher recall (more documents matching the query)
Re: [DISCUSS] - QueryIndex selection
On 18/06/2014 10:26, Tommaso Teofili wrote: it would be ok for me to either deprecate it or improve the semantics of the cost calculation (e.g. explicitly introduce other metrics to be taken into account in the cost calculation: local / remote index, With the IndexPlan.isDelayed() we instruct the query engine if the current index is asynchronous or not http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isDelayed%28%29 since the FullTextQueryIndex seems to be a marker interface, could we merge it some way into the AdvancedQueryIndex? Just to make things simpler. I think we already have this aspect in the IndexPlan.isFullText() http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isFulltextIndex%28%29 Cheers Davide
Re: [DISCUSS] - QueryIndex selection
Hi, QueryIndex.getCost my doubt is what this heuristic function to estimate the traversed entries should look like in general Relational databases typically know the number of entries in the index (total indexed entries), plus the selectivity of a column. See also http://www.akadia.com/services/ora_index_selectivity.html . That way it's quite easy to estimate the number of returned values, for a given filter. I think Lucene and Solr have a way to get those numbers, but I didn't look into that yet. I guess the problem might be, the cost method needs to be very fast, otherwise all the queries would be affected. Relational databases typically have those index statistics fully in memory. Some databases additionally have a histogram that describes the distribution of the values. See also http://www.dba-oracle.com/t_histograms.htm , but that may be implementation specific The cost returned by an index should not be implementation specific, otherwise the cost between different index implementations can't be compared. , and how that estimate should be used, should we just return the number of estimated entries for the cost? So that cost is 1 when we estimate to return 1 entry, 2 when we estimate to return 2 entries, 1000 when we estimate to return 1000 entries? Generally yes, but if you keep everything in memory or know things are cached, then you could return less (as documented). My other concern on this point is that it's not granted, in my opinion, that the index returning less entries would be the faster. Yes, it's not that much about less entries or more entries, it's about lower or higher cost. If there are more entries, but they are all in memory, then that's better than less entries, but you need a disk read for each entry. That's what the javadocs talk about. it would be ok for me to either deprecate it or improve the semantics of the cost calculation (e.g. explicitly introduce other metrics to be taken into account in the cost calculation: local / remote index, The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is used, the documented behavior should be used: ...and if there could in theory be one network roundtrip or disk read operation per node (this method may return a lower number if the data is known to be fully in memory). since the FullTextQueryIndex seems to be a marker interface, could we merge it some way into the AdvancedQueryIndex? Just to make things simpler. Yes. The IndexPlan of the AdvancedQueryIndex has a method isFulltextIndex. Once we can switch to the AdvancedQueryIndex, the FullTextQueryIndex is no longer needed. In the end my very generic take on this discussion is that we should try to have a clearer API/documentation for the part that involves picking the index via cost / index plans (but maybe that's only me), implementations can be then adjusted accordingly if needed. Yes, sure. Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
Hi, On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: should we just return the number of estimated entries for the cost? Yes, that's what I think the contract should be. My other concern on this point is that it's not granted, in my opinion, that the index returning less entries would be the faster. The key concern here isn't the performance of the index, but the overall performance of the query. If an index returns n paths for a given query, then the amount of work done by the query engine will be O(n) as it needs to look up each path and double-check the constraints on those nodes (or O(n log n) if it also needs to sort the results). Since I don't expect any reasonable index to perform worse than O(n), returning n as the cost estimate seems like the best thing to do. On a related note, I tend to disagree with the additional cost factors introduced in AdvancedQueryIndex/IndexPlan. In practically all cases the index lookups will be asymptotically faster than looking up all the paths returned by the index. Thus I don't think there is value in trying to make detailed estimates about the cost of the index lookup. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, On Wed, Jun 18, 2014 at 7:44 AM, Thomas Mueller muel...@adobe.com wrote: My other concern on this point is that it's not granted, in my opinion, that the index returning less entries would be the faster. Yes, it's not that much about less entries or more entries, it's about lower or higher cost. If there are more entries, but they are all in memory, then that's better than less entries, but you need a disk read for each entry. That's what the javadocs talk about. There's no way for a query index to know whether a particular node is cached in memory or not. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
ok, thanks Davide for the pointers. Regards, Tommaso 2014-06-18 13:36 GMT+02:00 Davide Giannella giannella.dav...@gmail.com: On 18/06/2014 10:26, Tommaso Teofili wrote: it would be ok for me to either deprecate it or improve the semantics of the cost calculation (e.g. explicitly introduce other metrics to be taken into account in the cost calculation: local / remote index, With the IndexPlan.isDelayed() we instruct the query engine if the current index is asynchronous or not http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isDelayed%28%29 since the FullTextQueryIndex seems to be a marker interface, could we merge it some way into the AdvancedQueryIndex? Just to make things simpler. I think we already have this aspect in the IndexPlan.isFullText() http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isFulltextIndex%28%29 Cheers Davide
Re: [DISCUSS] - QueryIndex selection
Hi, 2014-06-18 13:44 GMT+02:00 Thomas Mueller muel...@adobe.com: Hi, QueryIndex.getCost my doubt is what this heuristic function to estimate the traversed entries should look like in general Relational databases typically know the number of entries in the index (total indexed entries), plus the selectivity of a column. See also http://www.akadia.com/services/ora_index_selectivity.html . That way it's quite easy to estimate the number of returned values, for a given filter. I think Lucene and Solr have a way to get those numbers, but I didn't look into that yet. I guess the problem might be, the cost method needs to be very fast, otherwise all the queries would be affected. Relational databases typically have those index statistics fully in memory. ok thanks for clarifying, yes we could get such numbers from Lucene and Solr. Some databases additionally have a histogram that describes the distribution of the values. See also http://www.dba-oracle.com/t_histograms.htm , but that may be implementation specific The cost returned by an index should not be implementation specific, otherwise the cost between different index implementations can't be compared. exactly, that's one of my main concerns when starting this discussion. I wonder then if we should abstract the getCost method and just ask the indexes to return the estimated entries and let the QueryEngine assign a cost to each of them. , and how that estimate should be used, should we just return the number of estimated entries for the cost? So that cost is 1 when we estimate to return 1 entry, 2 when we estimate to return 2 entries, 1000 when we estimate to return 1000 entries? Generally yes, but if you keep everything in memory or know things are cached, then you could return less (as documented). My other concern on this point is that it's not granted, in my opinion, that the index returning less entries would be the faster. Yes, it's not that much about less entries or more entries, it's about lower or higher cost. If there are more entries, but they are all in memory, then that's better than less entries, but you need a disk read for each entry. That's what the javadocs talk about. yes, but I think the problem is that we don't have a defined function that implements that, so how much should we smoothen the cost if the entries are all in memory or viceversa increase the cost if there're network roundtrips? That's what it seems to me to be implementation specific and could lead to not comparable costs. it would be ok for me to either deprecate it or improve the semantics of the cost calculation (e.g. explicitly introduce other metrics to be taken into account in the cost calculation: local / remote index, The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is used, the documented behavior should be used: ...and if there could in theory be one network roundtrip or disk read operation per node (this method may return a lower number if the data is known to be fully in memory). since the FullTextQueryIndex seems to be a marker interface, could we merge it some way into the AdvancedQueryIndex? Just to make things simpler. Yes. The IndexPlan of the AdvancedQueryIndex has a method isFulltextIndex. Once we can switch to the AdvancedQueryIndex, the FullTextQueryIndex is no longer needed. ok In the end my very generic take on this discussion is that we should try to have a clearer API/documentation for the part that involves picking the index via cost / index plans (but maybe that's only me), implementations can be then adjusted accordingly if needed. Yes, sure. thanks for your comments Thomas. Regards, Tommaso Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
Hi, 2014-06-18 16:02 GMT+02:00 Jukka Zitting jukka.zitt...@gmail.com: Hi, On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: should we just return the number of estimated entries for the cost? Yes, that's what I think the contract should be. ok, that's different from what Thomas suggests, right? Just entry estimates, no network roundtrips / asynchronous index penalties, etc. My other concern on this point is that it's not granted, in my opinion, that the index returning less entries would be the faster. The key concern here isn't the performance of the index, but the overall performance of the query. If an index returns n paths for a given query, then the amount of work done by the query engine will be O(n) as it needs to look up each path and double-check the constraints on those nodes (or O(n log n) if it also needs to sort the results). Since I don't expect any reasonable index to perform worse than O(n), returning n as the cost estimate seems like the best thing to do. ok, under such perspective the index is not returning a cost, but how many nodes it will provide to the engine, the cost of the query is then a function of the number of entries. At the moment node number estimates and performance of the index aspects seem kind of merged into the getCost. Then we should probably decouple (at least) the concepts of: 1. how many nodes the index will return for this query (as an estimate) 2. how fast in retrieving the estimated nodes the index is Even with this distinction we would have to make some choices as given two indices returning the same number of estimated nodes for the same query, (I assume) the fastest should be chosen, but if two indices return two different node number estimates (e.g. that's likely if you have two different full text indices being able to handle the same query), which one should be chosen and why? On a related note, I tend to disagree with the additional cost factors introduced in AdvancedQueryIndex/IndexPlan. In practically all cases the index lookups will be asymptotically faster than looking up all the paths returned by the index. true, but then again the same question comes to my mind: how we select the index to be used? Are we just interested in how many nodes it will return or how much time it takes to do that plays some role in selection? Thus I don't think there is value in trying to make detailed estimates about the cost of the index lookup. Thanks to all, I will probably need some more time to digest all the thoughts and maybe I'll come back with more questions / ideas. Regards, Tommaso BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, On Wed, Jun 18, 2014 at 11:31 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: 2014-06-18 16:02 GMT+02:00 Jukka Zitting jukka.zitt...@gmail.com: On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili tommaso.teof...@gmail.com wrote: should we just return the number of estimated entries for the cost? Yes, that's what I think the contract should be. ok, that's different from what Thomas suggests, right? Just entry estimates, no network roundtrips / asynchronous index penalties, etc. Right. I don't believe the cost of the index lookup is significant (at least in the asymptotic sense) compared to the overall cost of executing a query. ok, under such perspective the index is not returning a cost, but how many nodes it will provide to the engine, the cost of the query is then a function of the number of entries. Exactly. At the moment node number estimates and performance of the index aspects seem kind of merged into the getCost. Then we should probably decouple (at least) the concepts of: 1. how many nodes the index will return for this query (as an estimate) 2. how fast in retrieving the estimated nodes the index is I would further argue that point 2 is mostly irrelevant for any decent index. The only case where I would expect index performance to show up as a significant factor is when n is small, but the best way to optimize such queries is probably to just cache the results per query instead of trying to make informed guesses about expected index performance. Even with this distinction we would have to make some choices as given two indices returning the same number of estimated nodes for the same query, (I assume) the fastest should be chosen, but if two indices return two different node number estimates (e.g. that's likely if you have two different full text indices being able to handle the same query), which one should be chosen and why? Unless there are other contributing factors (like preferring a synchronous index over an asynchronous one, or an explicit preference by a client), it shouldn't really matter much which one of equally costly indexes is being selected. BR, Jukka Zitting