Re: Name index

2014-06-23 Thread Davide Giannella
On 20/06/2014 18:06, Jukka Zitting wrote:
 Hi,

 Here's an idea for an index structure (for now somewhat specific to
 SegmentMK) for speeding up node name and property existence queries.
 ...

 INDEX UPDATES

 The index would be maintained by a normal index editor that for each
 added/removed items would update the respective name counts at each
 level of the index up to the root. These updates can be consolidated
 at each level of the index to keep the number of writes down to a
 minimum. The cost of updating the index would be O(d * k) where d is
 the depth of the changes in a commit and k the number of added/removed
 items. The cost is similar to that of the property index where k is
 the number of added/removed values. Changing the value of an existing
 property would trigger no updates to this index.

 Since the index updates work their way recursively up to the root of
 the tree, there is an inherent synchronization bottleneck in updating
 the count values higher up the tree. This (and the way storage is
 optimized) makes this index structure well suited for the SegmentMK,
 but presents a problem for the DocMK. In there a better way to
 implement a similar index might be to leverage the underlying indexing
 capabilities of MongoDB or whichever database is used under the hood.
 Alternatively it might be possible to adjust the index structure to
 avoid this problem, though for now I don't see any good way of doing
 so.
First of all I think it could work out well :)

What concern me most is the update part. AFAIU doing a node count it's
not that cheap so I guess you were thinking something around
getCount(MAX) and if the count == max do some estimation around what
could be there over the MAX limit.

The other bit is that is very Segment specific. As you already
highlighted we should come up with something different based on the
underlying used persistence.  It's not nice from a code point of view as
it will complicate things but I don't see any other way either as of
now. We could make it as if it's not there fall back on traversing as usual.

Last point is how/if we want the end user configure it. Within the
repository as a standard property index? I'm saying because it will have
an impact on repository size and performane for the update and maybe
someone would like to say something like: I know property foo won't bee
anywhere else than /a/b and in case I don't care. I would like to have
only /a/b updated for foo.

Cheers
Davide


Re: [DISCUSS] - QueryIndex selection

2014-06-23 Thread Thomas Mueller
Hi,

should we just return the number of estimated entries for the cost?

For Lucene, the property index, the ordered index, and the node type
index: yes. 

For Solr, the cost per index lookup (not per entry) is probably a bit
higher, because there is a network round trip. Specially if Solr is
remote. That's a fixed offset, so the cost could be: 1 +
estimatedEntryCount.

For in-memory indexes (that keep all entries in memory), no. One example
is an in-memory index of transient UUIDs. In this case it might depend on
whether the UUID is in memory or not. Calculating the cost in this case is
quick, as it's just an in-memory lookup. Therefore, the cost could be very
close to 0.

Right. I don't believe the cost of the index lookup is significant (at
least in the asymptotic sense) compared to the overall cost of
executing a query.

Sorry, I don't understand. The cost of the index lookup *is* significant
of course, specially if the nodes are not in the cache (which is very
common). In which case, index lookup can contribute to about 50% of the
overall cost of a query (if, let's assume, one disk read is needed for the
index lookup, and one disk read is needed to read the actual node). For a
covering index - 
http://stackoverflow.com/questions/62137/what-is-a-covered-index - the
index lookup is nearly 100% of the query cost.

 in the asymptotic sense

Sorry could you translate that for me please? :-)

ok, under such perspective the index is not returning a cost, but how
many
 nodes it will provide to the engine, the cost of the query is then a
 function of the number of entries.

Exactly.

As I wrote above, it depends on the nature of the index (in-memory, disk
based, remote or local).

instead of trying to make informed guesses about expected index
performance.

Trying to make an informed guess about the cost is the whole point of a
cost based optimizer - http://en.wikipedia.org/wiki/Query_optimization

it shouldn't really matter much which one of equally
costly indexes is being selected.

Sure, if the cost of two indexes is the same, then it doesn't matter which
index is used. That's the reason to make sure the returned cost is
somewhat accurate, and the reason to use the same scale (units of
measuring) in all index implementations.

Regards,
Thomas



Re: Name index

2014-06-23 Thread Thomas Mueller
Hi,

What if a node contains millions of (direct) child nodes, how would one do
an efficient lookup?

We have quite many (property) indexes, what would be the storage overhead?
(I think it would be quite significant with about 100 property indexes.)

 for speeding up node name and property existence queries.


How could this index structure speed up node name queries? Would the root
node know all the names of all nodes in the repository?

What would be very, very, very valuable (to better estimate the cost of an
index or traversal), is to know the number of descendent nodes of a node,
see also OAK-894.

Regards,
Thomas



On 20/06/14 18:06, Jukka Zitting jukka.zitt...@gmail.com wrote:

Hi,

Here's an idea for an index structure (for now somewhat specific to
SegmentMK) for speeding up node name and property existence queries.
The index could also be used to avoid full repository traversals when
processing queries with unindexed property constraints. In some cases
it might also work pretty well for queries with multi-property
constraints for which we currently don't have very good support.

INDEX STRUCTURE

The basic idea is to keep track of how many times each node or
property name occurs within each subtree. The index structure would
look something like this:

{ foo: { a: 238, b: 13 },
  bar: { a: 172, c: 86 } }

Such an index would for example tell that there are 238 nodes named
foo or with a property named foo within the /a subtree, and 86
nodes with bar within the /c subtree.

The index structure would be repeated for each subtree that contains
more than two nodes. For example the index structure for /a could
be:

{ foo: { d: 137, e: 100 },
  bar: { e: 100, f: 72 } }

So we'd have 137 nodes with foo in the /a/d subtree and 100 in the
/a/e subtree. Based on the index we can also tell that there are no
bar nodes or properties within the /a/d subtree.

COST CALCULATION

To calculate the cost of a query like //[@foo='...'], you'd look at
the index structure and add together the subtree counts for that name.
In this case the cost estimate would be:

238 + 13 = 251

If multiple names are referenced in a query, like //[@foo='...' and
@bar='...'], then you'd take the counts for both names and add
together the minimum value of each subtree. In this case the cost
estimate would be:

min(238, 172) + min(13, 0) + min(0, 86) = 172

If a query has a path restriction, like in
/jcr:root/a//[@foo='...'], we can use the index structure within
that path for a more specific cost estimate.

INDEX LOOKUP

Index lookup would be implemented as a tree traversal that's guided by
the subtree counts stored in the index.

When looking up a query like //[@foo='...'], you'd first look a the
top-level index structure to find that the only nodes with foo items
are within the /a and /b subtrees, and would then descend to those
subtrees. In /a you'd find that there are more foo items within
/a/d and /a/e so that's where the traversal would continue. Also,
since 137 + 100  238, you'd report the /a path as a potential match
for the query. Finally, when encountering a leaf node for which no
explicit index data is present, you'd do a hasProperty(foo) check to
determine whether that path should be reported as a potential match.

And if multiple names are referenced in a query, like //[@foo='...'
and @bar='...'], you'd use the minimum of the matching subtree counts
to direct the tree traversal. Since by looking at the index we can
tell that neither the /b nor the /c subtree contains both foo
and bar items, the traversal can be limited to just the /a
subtree, and in there to the /a/e subtree that is the only place
where both foo and bar can be found.

If a query has a path restriction, the guided traversal can start
directly at that path.

INDEX UPDATES

The index would be maintained by a normal index editor that for each
added/removed items would update the respective name counts at each
level of the index up to the root. These updates can be consolidated
at each level of the index to keep the number of writes down to a
minimum. The cost of updating the index would be O(d * k) where d is
the depth of the changes in a commit and k the number of added/removed
items. The cost is similar to that of the property index where k is
the number of added/removed values. Changing the value of an existing
property would trigger no updates to this index.

Since the index updates work their way recursively up to the root of
the tree, there is an inherent synchronization bottleneck in updating
the count values higher up the tree. This (and the way storage is
optimized) makes this index structure well suited for the SegmentMK,
but presents a problem for the DocMK. In there a better way to
implement a similar index might be to leverage the underlying indexing
capabilities of MongoDB or whichever database is used under the hood.
Alternatively it might be possible to adjust the index structure to
avoid this problem, though for now I don't 

Re: Adding a timer in commons

2014-06-23 Thread Michael Dürig


+1 in general. However,

- although it results in nice code on the client side, I'm a bit 
reluctant about putting all the code into the instance initialiser.


- how about reusing org.apache.jackrabbit.oak.stats.Clock instead of 
using Guava's Stopwatch? If necessary we could still implement Clock 
based on Stopwatch.


- Timer might not be the best name for the class. Should probably better 
be something with Log in its name


Michael

On 21.6.14 6:41 , Davide Giannella wrote:

Hello team,

in the warm laziness of a summer Saturday I was thinking about a way for
having in oak a timing facility for logging/benchmarking critical part
of the code. Something that we could put the logging to DEBUG and we
start see the information.

I started writing this quick sample

https://gist.github.com/davidegiannella/107f1f6bd16058020b64

any feedback? By using slf4j we would be enable to replace the
System.out with a LOG.debug and most of the magic would be done.

Don't know what the actual overhead for the class creation would be but
I like the layout within the code.

Any feedbacks?

Cheers
Davide




Re: Name index

2014-06-23 Thread Jukka Zitting
Hi,

On Mon, Jun 23, 2014 at 3:07 AM, Davide Giannella
giannella.dav...@gmail.com wrote:
 What concern me most is the update part. AFAIU doing a node count it's
 not that cheap so I guess you were thinking something around
 getCount(MAX) and if the count == max do some estimation around what
 could be there over the MAX limit.

Since the counts are already included in the index structure, all we
need to do is iterate over the properties of the matching index node
and sum up the values. If that turns out to be too costly, we can even
keep a pre-calculated total count in an extra property.

 The other bit is that is very Segment specific.

Right. The proposed index is designed to leverage some of the benefits
of the SegmentMK model. The DocMK has different tradeoffs and thus can
better accommodate a differently organized index.

BTW, the same applies also to the property index. The default content
mirroring strategy used by the property index is actually unnecessary
overhead on the SegmentMK, where using a multi-valued property per
each index entry would be more efficient and would also easily give
more exact cost estimates as the number of matching paths would be
directly available.

 Last point is how/if we want the end user configure it. Within the
 repository as a standard property index?

I have two alternatives in mind:

1) Using a normal query index definition in /oak:index, with a custom
index type like name.

2) Making the SegmentMK itself maintain such statistics in a hidden
subtree like /:segmentmk/names. There would be no need to explicitly
configure the index, and on repositories where that subtree is present
a matching QueryIndex implementation would automatically use it to
speed up affected queries.

I kind of like the latter approach, as it's conceptually similar to
the idea of using something like a MongoDB index to speed up certain
queries when using MongoMK.

 I'm saying because it will have an impact on repository size and performane
 for the update and maybe someone would like to say something like: I know
 property foo won't bee anywhere else than /a/b and in case I don't care. I
 would like to have only /a/b updated for foo.

As outlined in the proposal, the extra cost of updating this index is
actually pretty small. And the way the SegmentMK avoids duplicate
storage of names and values makes the size overhead pretty low.

So a non-configurable alternative like 2 might be just fine, or in
alternative 1 it would be possible to explicitly configure some
exclude rules like I don't care about the 'foo' name, nor the '/bar'
subtree.

BR,

Jukka Zitting


Re: Name index

2014-06-23 Thread Jukka Zitting
Hi,

On Mon, Jun 23, 2014 at 3:46 AM, Thomas Mueller muel...@adobe.com wrote:
 What if a node contains millions of (direct) child nodes, how would one do
 an efficient lookup?

The index structure contains the names of the matching subtrees, which
allows us to avoid iterating over all the child nodes in such cases.

For example, a part of the index structure for a node with 10k child
nodes named a might look something like this:

{ foo: { a2135: 238, a5382: 13 },
  bar: { a2135: 172, a9821: 86 } }

When looking up nodes with foo, you'd only need to consider the
a2135 and a5382 subtrees, and can ignore the remaining 9998
children.

 We have quite many (property) indexes, what would be the storage overhead?
 (I think it would be quite significant with about 100 property indexes.)

This would be a separate index, with nothing in common with the
property indexes.

Given the index structure and the way SegmentMK avoids storing
duplicate copies of repeating names, I expect the storage overhead to
be pretty low,  1% of the repository size. Big-O calculation of the
storage overhead is a bit complicated, so I plan to do a simple
prototype to better estimate the actual overhead on a few real-world
repositories.

 How could this index structure speed up node name queries? Would the root
 node know all the names of all nodes in the repository?

The root node would know all the *distinct* names in the repository,
along with the subtrees that contain at least a single instance of
those names.

 What would be very, very, very valuable (to better estimate the cost of an
 index or traversal), is to know the number of descendent nodes of a node,
 see also OAK-894.

Checking this index for the count of jcr:primaryType would give a
pretty accurate estimate of the size of a subtree.

Alternatively the index could be easily be extended to also maintain
an exact subtree size count, for example by indexing a virtual :node
name for each node in the repository. Similarly a property count could
be implemented by indexing :property along with the normal name of
each property. Even further, we could pretty efficiently keep track of
the size of each subtree by indexing :size for the length() of each
value. With such extensions the index data for an nt:resource leaf
node could look like this:

{ jcr:primaryType: 1,
  jcr:uuid, 1,
  jcr:data: 1,
  jcr:lastModified: 1,
  jcr:mimeType: 1,
  :node: 1,
  :property, 5,
  :size, 15827 }

(Of course, to save storage space, such single-node index data would
not actually be stored in the repository, only aggregated to the index
entries higher up the tree.)

BR,

Jukka Zitting


Re: [DISCUSS] - QueryIndex selection

2014-06-23 Thread Jukka Zitting
Hi,

On Mon, Jun 23, 2014 at 3:30 AM, Thomas Mueller muel...@adobe.com wrote:
Right. I don't believe the cost of the index lookup is significant (at
least in the asymptotic sense) compared to the overall cost of
executing a query.

 Sorry, I don't understand. The cost of the index lookup *is* significant
 of course, specially if the nodes are not in the cache (which is very
 common). In which case, index lookup can contribute to about 50% of the
 overall cost of a query (if, let's assume, one disk read is needed for the
 index lookup, and one disk read is needed to read the actual node).

The problem with that assumption is that typically a single disk read
to the index would return n paths, whereas loading those n nodes might
well take n more disk reads.

Similarly for the Solr case you mention: There is some overhead in the
network roundtrip, but the search server will typically reply with the
list of all matching paths in a single roundtrip, whereas then loading
those matching nodes will (at least with our current code) require
O(n) network roundtrips or disk reads.

 For a covering index -
 http://stackoverflow.com/questions/62137/what-is-a-covered-index - the
 index lookup is nearly 100% of the query cost.

Sure, but we don't use a covered index. At least in the current design
the query engine will always load all the matching nodes, regardless
of any extra information stored in the index. Thus we can't use the
performance of the index lookup as an accurate estimate of the overall
query performance.

 in the asymptotic sense

 Sorry could you translate that for me please? :-)

In the Big-O sense.

The overhead of the index lookup is probably significant when only few
matching paths are returned (the UUID index would be the ultimate
example), but in those cases query performance is probably best
optimized in other ways (caching, configuration, profiling, etc.) than
making a more accurate estimate of the index lookup performance,
especially since in most such cases there is only a single index with
exact matches.

When the number of matching paths becomes large, say in the 1k-1M
range, the relative performance overhead of different query indexes
becomes less of an issue. For example, say index A returns 10k
matching paths in 100ms and index B returns 20k less accurately
matching paths in just 10ms. If accessing each node takes 1ms on
average, the significantly faster performance of index B is totally
irrelevant to the overall time it takes to execute the full query.

 Trying to make an informed guess about the cost is the whole point of a
 cost based optimizer - http://en.wikipedia.org/wiki/Query_optimization

Agreed, but we should be making guesses about the overall query
performance, not just the index lookup time.

BR,

Jukka Zitting


Re: [DISCUSS] - QueryIndex selection

2014-06-23 Thread Thomas Mueller
Hi,

The problem with that assumption is that typically a single disk read
to the index would return n paths, whereas loading those n nodes might
well take n more disk reads.

Ideally, the cost returned of the index would reflect that. For
single-property indexes (all property indexes are single property right
now), the cost should be lower than for a multi-property index, because
one disk read operation would fetch more paths.

Sure, but we don't use a covered index.

Yes, we are not there yet. The node is currently loaded to check access
rights, but that's an implementation detail of access control part. And
it's not needed for the admin. If (when) this is changed, it depends on
the query. If the query only returns the path (for example), and the
indexed property, then the node does not need to be loaded.


 At least in the current design
the query engine will always load all the matching nodes, regardless
of any extra information stored in the index. Thus we can't use the
performance of the index lookup as an accurate estimate of the overall
query performance.

Yes, the query engine knows the cost overhead per entry, that's why the
AdvancedQueryIndex interface is a bit different (does not just contain the
cost).

The overhead of the index lookup is probably significant when only few
matching paths are returned (the UUID index would be the ultimate
example), but in those cases query performance is probably best
optimized in other ways (caching, configuration, profiling, etc.) than
making a more accurate estimate of the index lookup performance,
especially since in most such cases there is only a single index with
exact matches.

We have queries that have multiple property constraints (for example size
= 'L' and color = 'red'). If there are two indexes, one for size and one
for color, it's important to know which one is faster. (If there is a
multi-property index, that one might be faster.)

Agreed, but we should be making guesses about the overall query
performance, not just the index lookup time.

Yes, but the index implementation doesn't know (can't know) the cost of
the query engine. The query engine needs to calculate its own cost.

Regards,
Thomas



Re: [DISCUSS] - QueryIndex selection

2014-06-23 Thread Jukka Zitting
Hi,

On Mon, Jun 23, 2014 at 11:18 AM, Thomas Mueller muel...@adobe.com wrote:
Sure, but we don't use a covered index.

 Yes, we are not there yet. The node is currently loaded to check access
 rights, but that's an implementation detail of access control part. And
 it's not needed for the admin. If (when) this is changed, it depends on
 the query. If the query only returns the path (for example), and the
 indexed property, then the node does not need to be loaded.

It's more than access control. The query engine needs to double-check
the constraints of the query for each matching path before passing
that node to the client (see the constraint.evaluate() call in [1]). I
don't see any easy way to avoid that step without major refactoring.

 At least in the current design
the query engine will always load all the matching nodes, regardless
of any extra information stored in the index. Thus we can't use the
performance of the index lookup as an accurate estimate of the overall
query performance.

 Yes, the query engine knows the cost overhead per entry, that's why the
 AdvancedQueryIndex interface is a bit different (does not just contain the
 cost).

Are there any potential indexes where the
AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now
used in [2]) should return a value that's different from 1?

The overhead of the index lookup is probably significant when only few
matching paths are returned (the UUID index would be the ultimate
example), but in those cases query performance is probably best
optimized in other ways (caching, configuration, profiling, etc.) than
making a more accurate estimate of the index lookup performance,
especially since in most such cases there is only a single index with
exact matches.

 We have queries that have multiple property constraints (for example size
 = 'L' and color = 'red'). If there are two indexes, one for size and one
 for color, it's important to know which one is faster. (If there is a
 multi-property index, that one might be faster.)

Say I have 10k nodes distributed uniformly across two sizes and ten
colors. Querying for size=L and color=red would return 5000 paths
when using the size index, or 1000 paths when using the color index (a
multi-property index would return only the 500 exact matches). If the
size index is really fast and returns those 5000 paths in just 10ms
whereas the color index takes 100ms to return the 1000 matching paths,
it'll still be much slower to use the size index for the query if
accessing a single node takes say 1ms on average.

The index-level entry cost estimates only become relevant when the
cost of returning a path is more than a fraction of the cost of
loading a node. I don't believe that's the case for any reasonable
index implementations.

Agreed, but we should be making guesses about the overall query
performance, not just the index lookup time.

 Yes, but the index implementation doesn't know (can't know) the cost of
 the query engine. The query engine needs to calculate its own cost.

Agreed. I'm just worried about potential confusion about what the
getCostPerEntry() method (as used in [2]) should return. The value is
currently only set in [3], but there the estimate seems to be based on
the relative performance of the *index lookup*, not the overall
performance of a query. I believe either [2] or [3] should be adjusted
to fix the cost calculation.

[1] 
https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/query/QueryImpl.java#L628
[2] 
https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/query/QueryImpl.java#L810
[3] 
https://github.com/apache/jackrabbit-oak/blob/jackrabbit-oak-1.0.0/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/property/OrderedPropertyIndex.java#L73

BR,

Jukka Zitting


Re: [DISCUSS] - QueryIndex selection

2014-06-23 Thread Thomas Mueller
Hi,

It's more than access control. The query engine needs to double-check
the constraints of the query for each matching path before passing
that node to the client (see the constraint.evaluate() call in [1]). I
don't see any easy way to avoid that step without major refactoring.

If there is no other constraint, then no additional checks are needed.

Are there any potential indexes where the
AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now
used in [2]) should return a value that's different from 1?

Yes, for example an index that keep all (relevant) entries in memory, the
cost should be close to zero.

Say I have 10k nodes distributed uniformly across two sizes and ten
colors. Querying for size=L and color=red would return 5000 paths
when using the size index, or 1000 paths when using the color index (a
multi-property index would return only the 500 exact matches). If the
size index is really fast and returns those 5000 paths in just 10ms
whereas the color index takes 100ms to return the 1000 matching paths,
it'll still be much slower to use the size index for the query if
accessing a single node takes say 1ms on average.

Yes. The AdvancedQueryIndex has separate methods so that the query engine
can better calculate the cost (getCostPerExecution, getCostPerEntry,
getEstimatedEntryCount). The query could contain a limit (let's say 100)
which should also be taken into account, and possibly an order by
restriction. Plus the query engine could take into account that typically,
only the first 50 entries are read (optimize for fast first 50 entries -
see also 
http://stackoverflow.com/questions/1308946/should-i-use-query-hint-fast-num
ber-rows-fastfirstrow ).

The index-level entry cost estimates only become relevant when the
cost of returning a path is more than a fraction of the cost of
loading a node. I don't believe that's the case for any reasonable
index implementations.

It depends on whether (and when) the node needs to be loaded.

I'm just worried about potential confusion about what the
getCostPerEntry() method (as used in [2]) should return. The value is
currently only set in [3], but there the estimate seems to be based on
the relative performance of the *index lookup*, not the overall
performance of a query. I believe either [2] or [3] should be adjusted
to fix the cost calculation.

Yes, you are right. Currently the formula assumes that the query engine
doesn't load the node. That's not correct. I created OAK-1910 to track
this.

Regards,
Thomas



Re: [DISCUSS] - QueryIndex selection

2014-06-23 Thread Jukka Zitting
Hi,

On Mon, Jun 23, 2014 at 1:58 PM, Thomas Mueller muel...@adobe.com wrote:
It's more than access control. The query engine needs to double-check
the constraints of the query for each matching path before passing
that node to the client (see the constraint.evaluate() call in [1]). I
don't see any easy way to avoid that step without major refactoring.

 If there is no other constraint, then no additional checks are needed.

That's not correct. For example if I query for a value with more than
100 characters, the PropertyIndex may return paths that actually won't
match the query.

This requirement to double-check the results is also described in the
QueryIndex.guery() contract: An implementation should only filter the
result if it can do so easily and efficiently; the query engine will
verify the data again (in memory) and check for access rights.

Are there any potential indexes where the
AdvancedQueryIndex.getCostPerEntry() method (at least the way it's now
used in [2]) should return a value that's different from 1?

 Yes, for example an index that keep all (relevant) entries in memory, the
 cost should be close to zero.

Makes sense if we adapt the calculation as suggested in OAK-1910.

 Yes. The AdvancedQueryIndex has separate methods so that the query engine
 can better calculate the cost (getCostPerExecution, getCostPerEntry,
 getEstimatedEntryCount). The query could contain a limit (let's say 100)
 which should also be taken into account, and possibly an order by
 restriction. Plus the query engine could take into account that typically,
 only the first 50 entries are read (optimize for fast first 50 entries -
 see also
 http://stackoverflow.com/questions/1308946/should-i-use-query-hint-fast-num
 ber-rows-fastfirstrow ).

Agreed.

The index-level entry cost estimates only become relevant when the
cost of returning a path is more than a fraction of the cost of
loading a node. I don't believe that's the case for any reasonable
index implementations.

 It depends on whether (and when) the node needs to be loaded.

The node always needs to be loaded, see above.

I'm just worried about potential confusion about what the
getCostPerEntry() method (as used in [2]) should return. The value is
currently only set in [3], but there the estimate seems to be based on
the relative performance of the *index lookup*, not the overall
performance of a query. I believe either [2] or [3] should be adjusted
to fix the cost calculation.

 Yes, you are right. Currently the formula assumes that the query engine
 doesn't load the node. That's not correct. I created OAK-1910 to track
 this.

Thanks!

BR,

Jukka Zitting


Re: [DISCUSS] - QueryIndex selection

2014-06-23 Thread Thomas Mueller
Hi,

It's more than access control. The query engine needs to double-check
the constraints of the query for each matching path before passing
that node to the client (see the constraint.evaluate() call in [1]). I
don't see any easy way to avoid that step without major refactoring.

 If there is no other constraint, then no additional checks are needed.

That's not correct. For example if I query for a value with more than
100 characters, the PropertyIndex may return paths that actually won't
match the query.

Sorry, sure, the condition is verified again. But this might be an
in-memory operation. The index may return the property value for each
entry as part of running the query (QueryIndex - Cursor - IndexRow). I
think the index implementations don't do that currently, so in reality the
node is always loaded with the current version of Oak, that's true.
Javadoc of Cursor.next(): The index should return a row with those
properties that are stored in the index itself, so that the query engine
doesn't have to load the whole row / node unnecessarily.

Regards,
Thomas