Re: [DISCUSS] - QueryIndex selection
Hi, I looked a bit into how MongoDB selects indexes (query plans) and think we could take some inspiration. So, the way MongoDB does it afaiu: * query gets parsed into Abstract Syntax Tree (so that parameters can get stripped out) * the first time this query is performed then the query is executed against *all* available indexes * the fastest index is put into a cache, so that when the same query (abstracted, regardless of parameters) comes in, then only that fastest index will be used (will be looked up from cache) * after a number of modifications that index-selection-cache is flushed. Process starts at beginning. What I dislike about this process is that the first query puts a lot more into the system (due to the fact that all indexes perform the query). Moreover, the first execution of that query could be disturbed by noise, so the selection could be wrong. What I like, though, (if we ignore the noise issue from above) is that the selected index is the one that has actually proven to be the fastest. So, for Oak: maybe we could enhance the deterministic selection process we have right now. We could run queries in the background to determine if the cost factors that the indexers claim to have are actually correct (and if not, correct them in the query engine). Those background queries could be the ones “most often executed” by users on that repo that have multiple indexes capable of answering the query. Consider such a scenario: you have the same nodes indexed in the local property index (on the same machine that also serves requests) and a remote SolrCloud cluster. If we only reason about index size etc then we can never account for the fact that the local machine’s index might be much slower than those external machines that are used exclusively for answering queries. We could though, if we actually run those queries a number of times on both indexes. Cheers Michael
Re: [DISCUSS] - QueryIndex selection
Hi, On Thu, Jun 26, 2014 at 2:55 AM, Davide Giannella wrote: > Can't we do the ACL check lazily? Instead of the query engine looping > through the nodes and check, if there's no need of doing so already (IE > sorting), why not returning the set and then filter out the ACLs while > the user load the node using the NodeIterator? As Thomas said, we already do that. Since we want to estimate full cost of executing a query, including the iteration over the query results, we need to take also those costs into account when making a decision on which index to use. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, On Thu, Jun 26, 2014 at 4:10 AM, Angela Schreiber wrote: > however, please be aware that one key feature of oak (compared to > jackrabbit which only allowed permission evaluation by path) is that > it always needs to be clear if the target for the permission evaluation > is a node or a property. similarly, the restrictions may require the > item being retrieved in order to perform the required evaluation (e.g. > restricting access by node type). Agreed. Here with the covered index idea we however have a valid use case for considering changes to the internal access control API. Such a change would indeed make it harder to implement some access control features (like type-dependent restrictions), so we'd need to carefully weigh the benefits and drawbacks of such a change before implementing it. For now I tend to agree with you that we shouldn't change anything here, but I've been wrong before so I think it would be a good idea to keep our options open here. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, >Can't we do the ACL check lazily? That's what we do right now. Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
hi jukka this is not quite true. as i will explain below. first i would strongly recommend not to rely on the current implementation. if we have the requirement to evaluated permissions based on the path we may extend the permissionprovider which IMO is the key API for these cases; not the treepermissions. however, please be aware that one key feature of oak (compared to jackrabbit which only allowed permission evaluation by path) is that it always needs to be clear if the target for the permission evaluation is a node or a property. similarly, the restrictions may require the item being retrieved in order to perform the required evaluation (e.g. restricting access by node type). also the evaluation is obligated to respect the different types of items when it comes to read access. in other words: extending the the permissionprovider with something like isGranted(@Nonnull String treePath, @Nullable String propertyName, long permissions). will just be an optimization for those cases where full access is granted everywhere which can equally be determined by calling TreePermission#canReadAll on the root node. so, i would recommend to first explore if the method would not just serve the purpose. kind regards angela On 25/06/14 16:48, "Jukka Zitting" wrote: >Hi, > >On Wed, Jun 25, 2014 at 10:16 AM, Thomas Mueller >wrote: >> Yes, we would need to use a different access control API. The ability to >> check whether a session has access to a path/node/property, without >> actually loading the node from the storage backend. Maybe that API is >> already there? > >The TreePermission interface is the key API here, and the way we've >designed it requires loading the nodes being accessed (see the >getChildPermission method). The current implementation of the API >actually *doesn't* strictly require the loading of the underlying >nodes, so we might be able to refactor the API to better accommodate >such indexes. > >BR, > >Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
On 25/06/2014 16:48, Jukka Zitting wrote: > The TreePermission interface is the key API here, and the way we've > designed it requires loading the nodes being accessed (see the > getChildPermission method). The current implementation of the API > actually *doesn't* strictly require the loading of the underlying > nodes, so we might be able to refactor the API to better accommodate > such indexes. > Never looked into the details so I'm most probably wrong. Can't we do the ACL check lazily? Instead of the query engine looping through the nodes and check, if there's no need of doing so already (IE sorting), why not returning the set and then filter out the ACLs while the user load the node using the NodeIterator? D.
Re: [DISCUSS] - QueryIndex selection
Hi, On Wed, Jun 25, 2014 at 10:16 AM, Thomas Mueller wrote: > Yes, we would need to use a different access control API. The ability to > check whether a session has access to a path/node/property, without > actually loading the node from the storage backend. Maybe that API is > already there? The TreePermission interface is the key API here, and the way we've designed it requires loading the nodes being accessed (see the getChildPermission method). The current implementation of the API actually *doesn't* strictly require the loading of the underlying nodes, so we might be able to refactor the API to better accommodate such indexes. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, >But getting >to that point may be a bit tricky, especially because of access >control. Yes, we would need to use a different access control API. The ability to check whether a session has access to a path/node/property, without actually loading the node from the storage backend. Maybe that API is already there? Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
Hi, On Mon, Jun 23, 2014 at 4:23 PM, Thomas Mueller wrote: > Sorry, sure, the condition is verified again. But this might be an > in-memory operation. The index may return the property value for each > entry as part of running the query (QueryIndex - Cursor - IndexRow). I > think the index implementations don't do that currently, so in reality the > node is always loaded with the current version of Oak, that's true. > Javadoc of Cursor.next(): "The index should return a row with those > properties that are stored in the index itself, so that the query engine > doesn't have to load the whole row / node unnecessarily." Ah, I see where you're going! It would indeed be nice if the index could also provide the data to back more than the virtual properties (jcr:score, etc.) it now does. Then a query like "SELECT a, b, c FROM ..." with an index that covers a, b and c could avoid the (1 + ...) term from OAK-1910. But getting to that point may be a bit tricky, especially because of access control. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, >>>It's more than access control. The query engine needs to double-check >>>the constraints of the query for each matching path before passing >>>that node to the client (see the constraint.evaluate() call in [1]). I >>>don't see any easy way to avoid that step without major refactoring. >> >> If there is no other constraint, then no additional checks are needed. > >That's not correct. For example if I query for a value with more than >100 characters, the PropertyIndex may return paths that actually won't >match the query. Sorry, sure, the condition is verified again. But this might be an in-memory operation. The index may return the property value for each entry as part of running the query (QueryIndex - Cursor - IndexRow). I think the index implementations don't do that currently, so in reality the node is always loaded with the current version of Oak, that's true. Javadoc of Cursor.next(): "The index should return a row with those properties that are stored in the index itself, so that the query engine doesn't have to load the whole row / node unnecessarily." Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
Hi, On Mon, Jun 23, 2014 at 1:58 PM, Thomas Mueller wrote: >>It's more than access control. The query engine needs to double-check >>the constraints of the query for each matching path before passing >>that node to the client (see the constraint.evaluate() call in [1]). I >>don't see any easy way to avoid that step without major refactoring. > > If there is no other constraint, then no additional checks are needed. That's not correct. For example if I query for a value with more than 100 characters, the PropertyIndex may return paths that actually won't match the query. This requirement to double-check the results is also described in the QueryIndex.guery() contract: "An implementation should only filter the result if it can do so easily and efficiently; the query engine will verify the data again (in memory) and check for access rights." >>Are there any potential indexes where the >>AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now >>used in [2]) should return a value that's different from 1? > > Yes, for example an index that keep all (relevant) entries in memory, the > cost should be close to zero. Makes sense if we adapt the calculation as suggested in OAK-1910. > Yes. The AdvancedQueryIndex has separate methods so that the query engine > can better calculate the cost (getCostPerExecution, getCostPerEntry, > getEstimatedEntryCount). The query could contain a limit (let's say 100) > which should also be taken into account, and possibly an "order by" > restriction. Plus the query engine could take into account that typically, > only the first 50 entries are read (optimize for "fast first 50 entries" - > see also > http://stackoverflow.com/questions/1308946/should-i-use-query-hint-fast-num > ber-rows-fastfirstrow ). Agreed. >>The index-level entry cost estimates only become relevant when the >>cost of returning a path is more than a fraction of the cost of >>loading a node. I don't believe that's the case for any reasonable >>index implementations. > > It depends on whether (and when) the node needs to be loaded. The node always needs to be loaded, see above. >>I'm just worried about potential confusion about what the >>getCostPerEntry() method (as used in [2]) should return. The value is >>currently only set in [3], but there the estimate seems to be based on >>the relative performance of the *index lookup*, not the overall >>performance of a query. I believe either [2] or [3] should be adjusted >>to fix the cost calculation. > > Yes, you are right. Currently the formula assumes that the query engine > doesn't load the node. That's not correct. I created OAK-1910 to track > this. Thanks! BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, >It's more than access control. The query engine needs to double-check >the constraints of the query for each matching path before passing >that node to the client (see the constraint.evaluate() call in [1]). I >don't see any easy way to avoid that step without major refactoring. If there is no other constraint, then no additional checks are needed. >Are there any potential indexes where the >AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now >used in [2]) should return a value that's different from 1? Yes, for example an index that keep all (relevant) entries in memory, the cost should be close to zero. >Say I have 10k nodes distributed uniformly across two sizes and ten >colors. Querying for "size=L and color=red" would return 5000 paths >when using the size index, or 1000 paths when using the color index (a >multi-property index would return only the 500 exact matches). If the >size index is really fast and returns those 5000 paths in just 10ms >whereas the color index takes 100ms to return the 1000 matching paths, >it'll still be much slower to use the size index for the query if >accessing a single node takes say 1ms on average. Yes. The AdvancedQueryIndex has separate methods so that the query engine can better calculate the cost (getCostPerExecution, getCostPerEntry, getEstimatedEntryCount). The query could contain a limit (let's say 100) which should also be taken into account, and possibly an "order by" restriction. Plus the query engine could take into account that typically, only the first 50 entries are read (optimize for "fast first 50 entries" - see also http://stackoverflow.com/questions/1308946/should-i-use-query-hint-fast-num ber-rows-fastfirstrow ). >The index-level entry cost estimates only become relevant when the >cost of returning a path is more than a fraction of the cost of >loading a node. I don't believe that's the case for any reasonable >index implementations. It depends on whether (and when) the node needs to be loaded. >I'm just worried about potential confusion about what the >getCostPerEntry() method (as used in [2]) should return. The value is >currently only set in [3], but there the estimate seems to be based on >the relative performance of the *index lookup*, not the overall >performance of a query. I believe either [2] or [3] should be adjusted >to fix the cost calculation. Yes, you are right. Currently the formula assumes that the query engine doesn't load the node. That's not correct. I created OAK-1910 to track this. Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
Hi, On Mon, Jun 23, 2014 at 11:18 AM, Thomas Mueller wrote: >>Sure, but we don't use a covered index. > > Yes, we are not there yet. The node is currently loaded to check access > rights, but that's an implementation detail of access control part. And > it's not needed for the admin. If (when) this is changed, it depends on > the query. If the query only returns the path (for example), and the > indexed property, then the node does not need to be loaded. It's more than access control. The query engine needs to double-check the constraints of the query for each matching path before passing that node to the client (see the constraint.evaluate() call in [1]). I don't see any easy way to avoid that step without major refactoring. >> At least in the current design >>the query engine will always load all the matching nodes, regardless >>of any extra information stored in the index. Thus we can't use the >>performance of the index lookup as an accurate estimate of the overall >>query performance. > > Yes, the query engine knows the cost overhead per entry, that's why the > AdvancedQueryIndex interface is a bit different (does not just contain the > cost). Are there any potential indexes where the AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now used in [2]) should return a value that's different from 1? >>The overhead of the index lookup is probably significant when only few >>matching paths are returned (the UUID index would be the ultimate >>example), but in those cases query performance is probably best >>optimized in other ways (caching, configuration, profiling, etc.) than >>making a more accurate estimate of the index lookup performance, >>especially since in most such cases there is only a single index with >>exact matches. > > We have queries that have multiple property constraints (for example "size > = 'L' and color = 'red'"). If there are two indexes, one for size and one > for color, it's important to know which one is faster. (If there is a > multi-property index, that one might be faster.) Say I have 10k nodes distributed uniformly across two sizes and ten colors. Querying for "size=L and color=red" would return 5000 paths when using the size index, or 1000 paths when using the color index (a multi-property index would return only the 500 exact matches). If the size index is really fast and returns those 5000 paths in just 10ms whereas the color index takes 100ms to return the 1000 matching paths, it'll still be much slower to use the size index for the query if accessing a single node takes say 1ms on average. The index-level entry cost estimates only become relevant when the cost of returning a path is more than a fraction of the cost of loading a node. I don't believe that's the case for any reasonable index implementations. >>Agreed, but we should be making guesses about the overall query >>performance, not just the index lookup time. > > Yes, but the index implementation doesn't know (can't know) the cost of > the query engine. The query engine needs to calculate its own cost. Agreed. I'm just worried about potential confusion about what the getCostPerEntry() method (as used in [2]) should return. The value is currently only set in [3], but there the estimate seems to be based on the relative performance of the *index lookup*, not the overall performance of a query. I believe either [2] or [3] should be adjusted to fix the cost calculation. [1] https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/query/QueryImpl.java#L628 [2] https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/query/QueryImpl.java#L810 [3] https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/property/OrderedPropertyIndex.java#L73 BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, >The problem with that assumption is that typically a single disk read >to the index would return n paths, whereas loading those n nodes might >well take n more disk reads. Ideally, the cost returned of the index would reflect that. For single-property indexes (all property indexes are single property right now), the cost should be lower than for a multi-property index, because one disk read operation would fetch more paths. >Sure, but we don't use a covered index. Yes, we are not there yet. The node is currently loaded to check access rights, but that's an implementation detail of access control part. And it's not needed for the admin. If (when) this is changed, it depends on the query. If the query only returns the path (for example), and the indexed property, then the node does not need to be loaded. > At least in the current design >the query engine will always load all the matching nodes, regardless >of any extra information stored in the index. Thus we can't use the >performance of the index lookup as an accurate estimate of the overall >query performance. Yes, the query engine knows the cost overhead per entry, that's why the AdvancedQueryIndex interface is a bit different (does not just contain the cost). >The overhead of the index lookup is probably significant when only few >matching paths are returned (the UUID index would be the ultimate >example), but in those cases query performance is probably best >optimized in other ways (caching, configuration, profiling, etc.) than >making a more accurate estimate of the index lookup performance, >especially since in most such cases there is only a single index with >exact matches. We have queries that have multiple property constraints (for example "size = 'L' and color = 'red'"). If there are two indexes, one for size and one for color, it's important to know which one is faster. (If there is a multi-property index, that one might be faster.) >Agreed, but we should be making guesses about the overall query >performance, not just the index lookup time. Yes, but the index implementation doesn't know (can't know) the cost of the query engine. The query engine needs to calculate its own cost. Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
Hi, On Mon, Jun 23, 2014 at 3:30 AM, Thomas Mueller wrote: >>Right. I don't believe the cost of the index lookup is significant (at >>least in the asymptotic sense) compared to the overall cost of >>executing a query. > > Sorry, I don't understand. The cost of the index lookup *is* significant > of course, specially if the nodes are not in the cache (which is very > common). In which case, index lookup can contribute to about 50% of the > overall cost of a query (if, let's assume, one disk read is needed for the > index lookup, and one disk read is needed to read the actual node). The problem with that assumption is that typically a single disk read to the index would return n paths, whereas loading those n nodes might well take n more disk reads. Similarly for the Solr case you mention: There is some overhead in the network roundtrip, but the search server will typically reply with the list of all matching paths in a single roundtrip, whereas then loading those matching nodes will (at least with our current code) require O(n) network roundtrips or disk reads. > For a covering index - > http://stackoverflow.com/questions/62137/what-is-a-covered-index - the > index lookup is nearly 100% of the query cost. Sure, but we don't use a covered index. At least in the current design the query engine will always load all the matching nodes, regardless of any extra information stored in the index. Thus we can't use the performance of the index lookup as an accurate estimate of the overall query performance. >> in the asymptotic sense > > Sorry could you translate that for me please? :-) In the Big-O sense. The overhead of the index lookup is probably significant when only few matching paths are returned (the UUID index would be the ultimate example), but in those cases query performance is probably best optimized in other ways (caching, configuration, profiling, etc.) than making a more accurate estimate of the index lookup performance, especially since in most such cases there is only a single index with exact matches. When the number of matching paths becomes large, say in the 1k-1M range, the relative performance overhead of different query indexes becomes less of an issue. For example, say index A returns 10k matching paths in 100ms and index B returns 20k less accurately matching paths in just 10ms. If accessing each node takes 1ms on average, the significantly faster performance of index B is totally irrelevant to the overall time it takes to execute the full query. > Trying to make an informed guess about the cost is the whole point of a > cost based optimizer - http://en.wikipedia.org/wiki/Query_optimization Agreed, but we should be making guesses about the overall query performance, not just the index lookup time. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, >should we just return the number of estimated entries for the cost? For Lucene, the property index, the ordered index, and the node type index: yes. For Solr, the cost per index lookup (not per entry) is probably a bit higher, because there is a network round trip. Specially if Solr is remote. That's a fixed offset, so the cost could be: 1 + estimatedEntryCount. For in-memory indexes (that keep all entries in memory), no. One example is an in-memory index of transient UUIDs. In this case it might depend on whether the UUID is in memory or not. Calculating the cost in this case is quick, as it's just an in-memory lookup. Therefore, the cost could be very close to 0. >Right. I don't believe the cost of the index lookup is significant (at >least in the asymptotic sense) compared to the overall cost of >executing a query. Sorry, I don't understand. The cost of the index lookup *is* significant of course, specially if the nodes are not in the cache (which is very common). In which case, index lookup can contribute to about 50% of the overall cost of a query (if, let's assume, one disk read is needed for the index lookup, and one disk read is needed to read the actual node). For a covering index - http://stackoverflow.com/questions/62137/what-is-a-covered-index - the index lookup is nearly 100% of the query cost. > in the asymptotic sense Sorry could you translate that for me please? :-) >>ok, under such perspective the index is not returning a cost, but how >>many >> nodes it will provide to the engine, the cost of the query is then a >> function of the number of entries. > >Exactly. As I wrote above, it depends on the nature of the index (in-memory, disk based, remote or local). >instead of trying to make informed guesses about expected index >performance. Trying to make an informed guess about the cost is the whole point of a cost based optimizer - http://en.wikipedia.org/wiki/Query_optimization >it shouldn't really matter much which one of equally >costly indexes is being selected. Sure, if the cost of two indexes is the same, then it doesn't matter which index is used. That's the reason to make sure the returned cost is somewhat accurate, and the reason to use the same scale (units of measuring) in all index implementations. Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
Hi, On Wed, Jun 18, 2014 at 11:31 AM, Tommaso Teofili wrote: > 2014-06-18 16:02 GMT+02:00 Jukka Zitting : >> On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili >> wrote: >> > should we just return the number of estimated entries for the cost? >> >> Yes, that's what I think the contract should be. > > ok, that's different from what Thomas suggests, right? Just entry > estimates, no network roundtrips / asynchronous index penalties, etc. Right. I don't believe the cost of the index lookup is significant (at least in the asymptotic sense) compared to the overall cost of executing a query. > ok, under such perspective the index is not returning a cost, but how many > nodes it will provide to the engine, the cost of the query is then a > function of the number of entries. Exactly. > At the moment node number estimates and performance of the index aspects > seem kind of merged into the "getCost". > Then we should probably decouple (at least) the concepts of: > 1. how many nodes the index will return for this query (as an estimate) > 2. how fast in retrieving the estimated nodes the index is I would further argue that point 2 is mostly irrelevant for any decent index. The only case where I would expect index performance to show up as a significant factor is when n is small, but the best way to optimize such queries is probably to just cache the results per query instead of trying to make informed guesses about expected index performance. > Even with this distinction we would have to make some choices as given two > indices returning the same number of estimated nodes for the same query, (I > assume) the fastest should be chosen, but if two indices return two > different node number estimates (e.g. that's likely if you have two > different full text indices being able to handle the same query), which one > should be chosen and why? Unless there are other contributing factors (like preferring a synchronous index over an asynchronous one, or an explicit preference by a client), it shouldn't really matter much which one of equally costly indexes is being selected. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, 2014-06-18 16:02 GMT+02:00 Jukka Zitting : > Hi, > > On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili > wrote: > > should we just return the number of estimated entries for the cost? > > Yes, that's what I think the contract should be. > ok, that's different from what Thomas suggests, right? Just entry estimates, no network roundtrips / asynchronous index penalties, etc. > > > My other concern on this point is that it's not granted, in my opinion, > > that the index returning less entries would be the faster. > > The key concern here isn't the performance of the index, but the > overall performance of the query. If an index returns n paths for a > given query, then the amount of work done by the query engine will be > O(n) as it needs to look up each path and double-check the constraints > on those nodes (or O(n log n) if it also needs to sort the results). > Since I don't expect any reasonable index to perform worse than O(n), > returning n as the cost estimate seems like the best thing to do. > ok, under such perspective the index is not returning a cost, but how many nodes it will provide to the engine, the cost of the query is then a function of the number of entries. At the moment node number estimates and performance of the index aspects seem kind of merged into the "getCost". Then we should probably decouple (at least) the concepts of: 1. how many nodes the index will return for this query (as an estimate) 2. how fast in retrieving the estimated nodes the index is Even with this distinction we would have to make some choices as given two indices returning the same number of estimated nodes for the same query, (I assume) the fastest should be chosen, but if two indices return two different node number estimates (e.g. that's likely if you have two different full text indices being able to handle the same query), which one should be chosen and why? > > On a related note, I tend to disagree with the additional cost factors > introduced in AdvancedQueryIndex/IndexPlan. In practically all cases > the index lookups will be asymptotically faster than looking up all > the paths returned by the index. true, but then again the same question comes to my mind: how we select the index to be used? Are we just interested in how many nodes it will return or how much time it takes to do that plays some role in selection? > Thus I don't think there is value in > trying to make detailed estimates about the cost of the index lookup. > Thanks to all, I will probably need some more time to digest all the thoughts and maybe I'll come back with more questions / ideas. Regards, Tommaso > > BR, > > Jukka Zitting >
Re: [DISCUSS] - QueryIndex selection
Hi, 2014-06-18 13:44 GMT+02:00 Thomas Mueller : > Hi, > > >>QueryIndex.getCost > > >my doubt is what > >this heuristic function to estimate the "traversed entries" should look > >like in general > > Relational databases typically know the number of entries in the index > (total indexed entries), plus the selectivity of a column. See also > http://www.akadia.com/services/ora_index_selectivity.html . That way it's > quite easy to estimate the number of returned values, for a given filter. > I think Lucene and Solr have a way to get those numbers, but I didn't look > into that yet. I guess the problem might be, the cost method needs to be > very fast, otherwise all the queries would be affected. Relational > databases typically have those index statistics fully in memory. > ok thanks for clarifying, yes we could get such numbers from Lucene and Solr. > > Some databases additionally have a histogram that describes the > distribution of the values. See also > http://www.dba-oracle.com/t_histograms.htm > > >, but that may be implementation specific > > The cost returned by an index should not be implementation specific, > otherwise the cost between different index implementations can't be > compared. > exactly, that's one of my main concerns when starting this discussion. I wonder then if we should abstract the getCost method and just ask the indexes to return the estimated entries and let the QueryEngine assign a cost to each of them. > > >, and how that > >estimate should be used, should we just return the number of estimated > >entries for the cost? So that cost is 1 when we estimate to return 1 > >entry, > >2 when we estimate to return 2 entries, 1000 when we estimate to return > >1000 entries? > > Generally yes, but if you keep everything in memory or know things are > cached, then you could return less (as documented). > > >My other concern on this point is that it's not granted, in my opinion, > >that the index returning less entries would be the faster. > > Yes, it's not that much about less entries or more entries, it's about > lower or higher cost. If there are more entries, but they are all in > memory, then that's better than less entries, but you need a disk read for > each entry. That's what the javadocs talk about. > yes, but I think the problem is that we don't have a defined "function" that implements that, so how much should we smoothen the cost if the entries are all in memory or viceversa increase the cost if there're network roundtrips? That's what it seems to me to be implementation specific and could lead to not comparable costs. > > >it would be ok for me to either deprecate it or improve the semantics of > >the cost calculation (e.g. explicitly introduce other metrics to be taken > >into account in the cost calculation: local / remote index, > > The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is > used, the documented behavior should be used: "...and if there could in > theory be one network roundtrip or disk read operation per node (this > method may return a lower number if the data is known to be fully in > memory)." > > >since the FullTextQueryIndex seems to be a marker interface, could we > >"merge" it some way into the AdvancedQueryIndex? Just to make things > >simpler. > > Yes. The IndexPlan of the AdvancedQueryIndex has a method > "isFulltextIndex". Once we can switch to the AdvancedQueryIndex, the > FullTextQueryIndex is no longer needed. > ok > > >In the end my very generic take on this discussion is that we should try > >to > >have a clearer API/documentation for the part that involves picking the > >index via cost / index plans (but maybe that's only me), implementations > >can be then adjusted accordingly if needed. > > Yes, sure. > thanks for your comments Thomas. Regards, Tommaso > > Regards, > Thomas > >
Re: [DISCUSS] - QueryIndex selection
ok, thanks Davide for the pointers. Regards, Tommaso 2014-06-18 13:36 GMT+02:00 Davide Giannella : > On 18/06/2014 10:26, Tommaso Teofili wrote: > > it would be ok for me to either deprecate it or improve the semantics > > of the cost calculation (e.g. explicitly introduce other metrics to be > > taken into account in the cost calculation: local / remote index, > With the IndexPlan.isDelayed() we instruct the query engine if the > current index is asynchronous or not > > > http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isDelayed%28%29 > > > since the FullTextQueryIndex seems to be a marker interface, could we > > "merge" it some way into the AdvancedQueryIndex? Just to make things > > simpler. > I think we already have this aspect in the IndexPlan.isFullText() > > > http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isFulltextIndex%28%29 > > Cheers > Davide > > >
Re: [DISCUSS] - QueryIndex selection
Hi, On Wed, Jun 18, 2014 at 7:44 AM, Thomas Mueller wrote: >>My other concern on this point is that it's not granted, in my opinion, >>that the index returning less entries would be the faster. > > Yes, it's not that much about less entries or more entries, it's about > lower or higher cost. If there are more entries, but they are all in > memory, then that's better than less entries, but you need a disk read for > each entry. That's what the javadocs talk about. There's no way for a query index to know whether a particular node is cached in memory or not. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, On Wed, Jun 18, 2014 at 4:26 AM, Tommaso Teofili wrote: > should we just return the number of estimated entries for the cost? Yes, that's what I think the contract should be. > My other concern on this point is that it's not granted, in my opinion, > that the index returning less entries would be the faster. The key concern here isn't the performance of the index, but the overall performance of the query. If an index returns n paths for a given query, then the amount of work done by the query engine will be O(n) as it needs to look up each path and double-check the constraints on those nodes (or O(n log n) if it also needs to sort the results). Since I don't expect any reasonable index to perform worse than O(n), returning n as the cost estimate seems like the best thing to do. On a related note, I tend to disagree with the additional cost factors introduced in AdvancedQueryIndex/IndexPlan. In practically all cases the index lookups will be asymptotically faster than looking up all the paths returned by the index. Thus I don't think there is value in trying to make detailed estimates about the cost of the index lookup. BR, Jukka Zitting
Re: [DISCUSS] - QueryIndex selection
Hi, >>QueryIndex.getCost >my doubt is what >this heuristic function to estimate the "traversed entries" should look >like in general Relational databases typically know the number of entries in the index (total indexed entries), plus the selectivity of a column. See also http://www.akadia.com/services/ora_index_selectivity.html . That way it's quite easy to estimate the number of returned values, for a given filter. I think Lucene and Solr have a way to get those numbers, but I didn't look into that yet. I guess the problem might be, the cost method needs to be very fast, otherwise all the queries would be affected. Relational databases typically have those index statistics fully in memory. Some databases additionally have a histogram that describes the distribution of the values. See also http://www.dba-oracle.com/t_histograms.htm >, but that may be implementation specific The cost returned by an index should not be implementation specific, otherwise the cost between different index implementations can't be compared. >, and how that >estimate should be used, should we just return the number of estimated >entries for the cost? So that cost is 1 when we estimate to return 1 >entry, >2 when we estimate to return 2 entries, 1000 when we estimate to return >1000 entries? Generally yes, but if you keep everything in memory or know things are cached, then you could return less (as documented). >My other concern on this point is that it's not granted, in my opinion, >that the index returning less entries would be the faster. Yes, it's not that much about less entries or more entries, it's about lower or higher cost. If there are more entries, but they are all in memory, then that's better than less entries, but you need a disk read for each entry. That's what the javadocs talk about. >it would be ok for me to either deprecate it or improve the semantics of >the cost calculation (e.g. explicitly introduce other metrics to be taken >into account in the cost calculation: local / remote index, The QueryIndex will be deprated, sure. Until the AdvancedQueryIndex is used, the documented behavior should be used: "...and if there could in theory be one network roundtrip or disk read operation per node (this method may return a lower number if the data is known to be fully in memory)." >since the FullTextQueryIndex seems to be a marker interface, could we >"merge" it some way into the AdvancedQueryIndex? Just to make things >simpler. Yes. The IndexPlan of the AdvancedQueryIndex has a method "isFulltextIndex". Once we can switch to the AdvancedQueryIndex, the FullTextQueryIndex is no longer needed. >In the end my very generic take on this discussion is that we should try >to >have a clearer API/documentation for the part that involves picking the >index via cost / index plans (but maybe that's only me), implementations >can be then adjusted accordingly if needed. Yes, sure. Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
On 18/06/2014 10:26, Tommaso Teofili wrote: > it would be ok for me to either deprecate it or improve the semantics > of the cost calculation (e.g. explicitly introduce other metrics to be > taken into account in the cost calculation: local / remote index, With the IndexPlan.isDelayed() we instruct the query engine if the current index is asynchronous or not http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isDelayed%28%29 > since the FullTextQueryIndex seems to be a marker interface, could we > "merge" it some way into the AdvancedQueryIndex? Just to make things > simpler. I think we already have this aspect in the IndexPlan.isFullText() http://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/query/QueryIndex.IndexPlan.html#isFulltextIndex%28%29 Cheers Davide
Re: [DISCUSS] - QueryIndex selection
2014-06-04 9:36 GMT+02:00 Thomas Mueller : > Hi, > > QueryIndex.getCost: this is actually quite well documented (see the > Javadocs). But the implementations might not fully follow the contract :-) > this is probably just my opinion but the contract is not much clear; to me finding "the worst-case cost to query with the given filter" defines what should be calculated and "the returned cost is a value between 1 (very fast; lookup of a unique node) and the estimated number of entries to traverse" defines the output range as a function of the number of the, estimated, number of results that the query would eventually return using that index but, regardless of existing implementations, my doubt is what this heuristic function to estimate the "traversed entries" should look like in general, but that may be implementation specific, and how that estimate should be used, should we just return the number of estimated entries for the cost? So that cost is 1 when we estimate to return 1 entry, 2 when we estimate to return 2 entries, 1000 when we estimate to return 1000 entries? My other concern on this point is that it's not granted, in my opinion, that the index returning less entries would be the faster. > But anyway, I think it's anyway the better to deprecate it and use > AdvancedQueryIndex, as it has more features (specially important for > ordered indexes). it would be ok for me to either deprecate it or improve the semantics of the cost calculation (e.g. explicitly introduce other metrics to be taken into account in the cost calculation: local / remote index, > Currently, both QueryIndex and AdvancedQueryIndex are > supported, in the future I hope we can switch all index implementations to > AdvancedQueryIndex. I didn't want to do that just before the 1.0 release > however. > right, good point. > > > get rid for example of the FullTextQueryIndex interface > > The FullTextQueryIndex allows to chose a full-text index for full-text > queries, if one is available, even if the cost is higher. The problem is > that only full-text indexes can return the correctly data, if full-text > constraints are used. If we want to get rid of the FullTextQueryIndex > interface, we need to address this problem in some other way. > since the FullTextQueryIndex seems to be a marker interface, could we "merge" it some way into the AdvancedQueryIndex? Just to make things simpler. > > > should we always select the fastest index ? Especially for full text > >ones this should be in some way configurable. > > There are multiple aspects to this: (a) "let the user decide which index > to use", and (b) "synchronous versus async indexes": > > (a) The user should be able to decide which index to use for a certain > query. There are some problems with that: The index the user has in mind > might not be available (in a certain environment, or with a later version > of Oak, because for example the index implementation was replaced, or when > not using Oak). right, in this case we can fallback to the current mechanism where the query engine decides which index to use > Hardcoding the index to use (in the query) is problematic. > Relational databases: Oracle supports "hints" (search for "oracle database > hint"). SQLite supports something similar, but there it's actually an > assertion that an index is available, not a hint: > http://www.sqlite.org/lang_indexedby.html . My position is that we should > avoid such a mechanism, and instead improve the query engine, the indexes > implementations, and the documentation instead. > +1, I would not want to specify the index to be used in the (JCR) query, the approach I was thinking too was to let the user to be able to give an order of preference of the index to be used (at a repository level) which would be respected in case those indexes exist and again that would fallback to current selection mechanism in case of missing index(es). > > (b) Synchronous versus async indexes: some indexes are updated > asynchronously, and therefore will not include the very latest additions > to the content. (Recently removed nodes are not a problem, as the query > engine will anyway have to check if a node is available). We could let the > user decide if using an asynchronous index is OK or not. > > For both (a) and (b), one problem is that the JCR spec doesn't allow for > extensions in the query syntax, so if the user would use "select ... > option async_ok", the query would not work with Jackrabbit 2.x and other > JCR implementations. Maybe we should create a JCR commons utility method, > so that one could use: > > QueryManager qm = ... > > Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK); > > > The method JcrUtils.createQuery could then use "instanceof" to decide > whether it's OK to modify the query (for Oak) or not (non-Oak). > yes, that's an option but if we could avoid having it in the query that would be better. > > > for full text queries for example, one may be inter
Re: [DISCUSS] - QueryIndex selection
>> >>We could let the >> user decide if using an asynchronous index is OK or not. > >Another option is if there is no synch index available but an asynch >index is available then QueryEngine should use that instead of >resorting to traversal. Well, this is the current behavior. The query engine doesn't know which index is async. If an async index reports a reasonable cost, the it is already used. But this might not be what the user wants: Maybe having up-to-date results is important for him. So, maybe instead of "async_ok", we need a flag "async_not_ok". Regards, Thomas
Re: [DISCUSS] - QueryIndex selection
On Wed, Jun 4, 2014 at 1:06 PM, Thomas Mueller wrote: > We could let the > user decide if using an asynchronous index is OK or not. Another option is if there is no synch index available but an asynch index is available then QueryEngine should use that instead of resorting to traversal. This should enable us to support existing codebase where people have not created index for all possible cases. Chetan Mehrotra
Re: [DISCUSS] - QueryIndex selection
Hi, QueryIndex.getCost: this is actually quite well documented (see the Javadocs). But the implementations might not fully follow the contract :-) But anyway, I think it's anyway the better to deprecate it and use AdvancedQueryIndex, as it has more features (specially important for ordered indexes). Currently, both QueryIndex and AdvancedQueryIndex are supported, in the future I hope we can switch all index implementations to AdvancedQueryIndex. I didn't want to do that just before the 1.0 release however. > get rid for example of the FullTextQueryIndex interface The FullTextQueryIndex allows to chose a full-text index for full-text queries, if one is available, even if the cost is higher. The problem is that only full-text indexes can return the correctly data, if full-text constraints are used. If we want to get rid of the FullTextQueryIndex interface, we need to address this problem in some other way. > should we always select the fastest index ? Especially for full text >ones this should be in some way configurable. There are multiple aspects to this: (a) "let the user decide which index to use", and (b) "synchronous versus async indexes": (a) The user should be able to decide which index to use for a certain query. There are some problems with that: The index the user has in mind might not be available (in a certain environment, or with a later version of Oak, because for example the index implementation was replaced, or when not using Oak). Hardcoding the index to use (in the query) is problematic. Relational databases: Oracle supports "hints" (search for "oracle database hint"). SQLite supports something similar, but there it's actually an assertion that an index is available, not a hint: http://www.sqlite.org/lang_indexedby.html . My position is that we should avoid such a mechanism, and instead improve the query engine, the indexes implementations, and the documentation instead. (b) Synchronous versus async indexes: some indexes are updated asynchronously, and therefore will not include the very latest additions to the content. (Recently removed nodes are not a problem, as the query engine will anyway have to check if a node is available). We could let the user decide if using an asynchronous index is OK or not. For both (a) and (b), one problem is that the JCR spec doesn't allow for extensions in the query syntax, so if the user would use "select ... option async_ok", the query would not work with Jackrabbit 2.x and other JCR implementations. Maybe we should create a JCR commons utility method, so that one could use: QueryManager qm = ... Query q = JcrUtils.createQuery(qm, query, language, ASYNC_OK); The method JcrUtils.createQuery could then use "instanceof" to decide whether it's OK to modify the query (for Oak) or not (non-Oak). > for full text queries for example, one may be interested in having a >higher recall (more documents matching the query) which may eventually >lead to a slightly slower query execution / higher cost evaluation That would also need to be specified by the user in some way, right? For example in the query itself? We could use a similar mechanism than ASYNC_OK above, so the application would still work for Jackrabbit 2.x. Regards, Thomas On 26/05/14 10:25, "Tommaso Teofili" wrote: >Hi all, > >I'd like to start discussing how we may improve / simplify current way of >selecting a query engine to use for a certain query. > >In the QueryIndex interface we have the plain old getCost method which >selects the index returning the lower cost for the given query but, >recently, also an AdvancedQueryIndex interface has been introduced which, >if I understood things correctly, uses the IndexPlan(s) returned by each >query index for the given query to select which one has to be used. >So I would like to discuss if it's possible to clean up things a bit in >order to have a unified query selection mechanism. > >At the moment, in my opinion, one problem with the getCost() method is >that >it inherently merges the following topics: >- index capability to handle a certain query (can the QueryIndex handle >that query?) >- index efficiency in handling a certain query (how fast will the >QueryIndex will be in handling that query?) > >Also the efficiency is not evaluated on a "cost model", each QueryIndex >implementation can return an arbitrary different number; on one hand this >is ok as it allows to take very index specific constraint into account: on >the other hand if one has to write a new QueryIndex implementation he/she >will have to look into each other query index implementation to understand >(and design) if / when its index is picked up; and even with already >existing indexes it's not easy to say upfront which one will be selected >(e.g. for debugging purposes). > >With the AdvancedQueryIndex, if I understood it correctly (I just had a >look at it on Friday), a QueryIndex is selected upon its IndexPlan, which >is supposed to address better both th
Re: [DISCUSS] - QueryIndex selection
2014-05-27 11:21 GMT+02:00 Davide Giannella : > On 26/05/2014 09:25, Tommaso Teofili wrote: > > ... > > Also the efficiency is not evaluated on a "cost model", each QueryIndex > > implementation can return an arbitrary different number; on one hand this > > is ok as it allows to take very index specific constraint into account: > on > > the other hand if one has to write a new QueryIndex implementation he/she > > will have to look into each other query index implementation to > understand > > (and design) if / when its index is picked up; and even with already > > existing indexes it's not easy to say upfront which one will be selected > > (e.g. for debugging purposes). > > I don't know if by using the IndexPlans it will be possible to says > beforehand which index will be pick up. It really depends by the query > engine logics and if there are other indexes that could perform faster > why not choose them? > for full text queries for example, one may be interested in having a higher recall (more documents matching the query) which may eventually lead to a slightly slower query execution / higher cost evaluation, then if we only select the fastest that use case cannot be addressed. > > > With the AdvancedQueryIndex, if I understood it correctly (I just had a > > look at it on Friday), a QueryIndex is selected upon its IndexPlan, which > > is supposed to address better both the cost (as it explicitly exposes the > > cost per execution, cost per entry and estimated entry count metrics) and > > the query index capability to handle a certain query (e.g. this is used > for > > ordered property index). > > However, at the moment, only the OrderedPropertyIndex is using it so I > > think it'd be good to decide if we want to go further with the > > AdvancedQueryIndex also for the other QueryIndex implementations (and get > > rid for example of the FullTextQueryIndex interface as it seems useless > to > > me) or not. > > > > One final question on query index selection, should we always select the > > fastest index ? > > Especially for full text ones this should be in some way configurable. > > Yes we discussed off-line last time. It seems that it would be good for > the query engine to expose an API in which the client application could > state: run with this index. Something like > queryengine.forceIndex("path/to/oak:QueryIndexDefinition"). If > specified the query engine will know about it and skip the index > selection by forcing it. > > Useful definitely for debugging as well as in edge cases where the > client application will know that the query has to be run with a > specific index for a reason or another. > > ... > > As discussed also offline last week with some other folks maybe one > further > > metric to be taken into consideration for the index selection is if the > > index is synchronous or not > > > The current index plan of the AdvancedQueryIndex expose the sync/async > aspect by IndexPlan.isDelayed(). If that is taken into account yet I > don't know :) > ok > > Another aspect of improvement we were discussing is on the query engine > side. As the index selections and plan are expensive if the query engine > is asked to execute a query with bindings[0] it could cache the selected > index for either a fixed amount of time or other logics. > > (0) http://goo.gl/LDQw1Y > > Last one as we where discussing about the possibility of serving queries > for "all" the properties from Lucene/Solr a metrics of evaluation could > be that in case of the same property served by a sync property index and > a Lucene, the first one will have should be chosen as it would be local. > right. Thanks, Tommaso > > D. > > >
Re: [DISCUSS] - QueryIndex selection
On 26/05/2014 09:25, Tommaso Teofili wrote: > ... > Also the efficiency is not evaluated on a "cost model", each QueryIndex > implementation can return an arbitrary different number; on one hand this > is ok as it allows to take very index specific constraint into account: on > the other hand if one has to write a new QueryIndex implementation he/she > will have to look into each other query index implementation to understand > (and design) if / when its index is picked up; and even with already > existing indexes it's not easy to say upfront which one will be selected > (e.g. for debugging purposes). I don't know if by using the IndexPlans it will be possible to says beforehand which index will be pick up. It really depends by the query engine logics and if there are other indexes that could perform faster why not choose them? > With the AdvancedQueryIndex, if I understood it correctly (I just had a > look at it on Friday), a QueryIndex is selected upon its IndexPlan, which > is supposed to address better both the cost (as it explicitly exposes the > cost per execution, cost per entry and estimated entry count metrics) and > the query index capability to handle a certain query (e.g. this is used for > ordered property index). > However, at the moment, only the OrderedPropertyIndex is using it so I > think it'd be good to decide if we want to go further with the > AdvancedQueryIndex also for the other QueryIndex implementations (and get > rid for example of the FullTextQueryIndex interface as it seems useless to > me) or not. > > One final question on query index selection, should we always select the > fastest index ? > Especially for full text ones this should be in some way configurable. Yes we discussed off-line last time. It seems that it would be good for the query engine to expose an API in which the client application could state: run with this index. Something like queryengine.forceIndex("path/to/oak:QueryIndexDefinition"). If specified the query engine will know about it and skip the index selection by forcing it. Useful definitely for debugging as well as in edge cases where the client application will know that the query has to be run with a specific index for a reason or another. > ... > As discussed also offline last week with some other folks maybe one further > metric to be taken into consideration for the index selection is if the > index is synchronous or not > The current index plan of the AdvancedQueryIndex expose the sync/async aspect by IndexPlan.isDelayed(). If that is taken into account yet I don't know :) Another aspect of improvement we were discussing is on the query engine side. As the index selections and plan are expensive if the query engine is asked to execute a query with bindings[0] it could cache the selected index for either a fixed amount of time or other logics. (0) http://goo.gl/LDQw1Y Last one as we where discussing about the possibility of serving queries for "all" the properties from Lucene/Solr a metrics of evaluation could be that in case of the same property served by a sync property index and a Lucene, the first one will have should be chosen as it would be local. D.