Re: Query Performance and Optimization

David Johnson Tue, 13 Mar 2007 11:51:39 -0800

DescendantSelfAxisQuery is now taking the most time in the profiling that I
have recently done.

From my earlier post:


Out of the Jackrabbit code,
DescendantSelfAxisQuery.DescendantSelfAxisScorer.next()
is now taking the most time while executing my query suite - taking 68% of
the time, within it, calls to
DescendantSelfAxisQuery.DescendantSelfAxisScorer.calculateSubHits() taking
the majority of time (basically all of the time).  Then calls to
BooleanScorer2.score(HitCollector) - back to Lucene code - is taking the
majority of time.  If more specific profiling data is desired, please feel
free to ask.  I can also share the profile data in the form of a Netbeans
profile snapshot.

Any magic patch that might address these performance bottlenecks?  At this
point it doesn't look like the range queries are the issue anymore.

--- end of earlier post ---

While I can see how my suggested optimization could severely impact some use
cases.  Nevertheless, "our use case" :-) is mostly querying a stable
hierarchy structure - i.e., we rarely, if ever, would move a tree with even
1000s of sub-nodes (famous last words).  And we use the node hierarchy in
our queries --- well, always!

The balance between use cases is suggesting to me the need for
administratively defined indexes that could be used by the query processing
engine if they existed.  So that users (repository administrators) could
define the indexes - like this one - with the knowledge that certain
operations (a move) would require a fairly expensive rebuild of the indexing
structures.

Could you give some more detail on how ChildAxisQuery and
DescendantSelfAxisQuery work.  On first read, the comment at the beginning
were not completely clear to me - more than likely related to me.  I think I
get what they are doing, I just would like a little more overview, to help
me before jumping in and attempting to understand the code.  It seems that
the query parser breaks down a path into its pieces, and this is then fed
into the LuceneQueryParser as location steps - and these get changed into
ChildAxisQuery or DescendantSelfAxisQuery as appropriate?

-Dave

On 3/13/07, Marcel Reutegger <[EMAIL PROTECTED]> wrote:

well, the problem with that approach is the following:

assume you have a tree of nodes under /a, let's say 10 million nodes. then
a
user renames /a to /b. the index would have to re-index 10 million nodes.
this
operation is currently very efficient and takes just a couple of
milliseconds,
because the nodes in the index are just linked with a parent uuid.
renaming a
node simply means an update of one node (document) in the index.

but I agree with both of you that there is a lot of potential in
optimizing
path/hierarchy resolution in the lucene query handler in jackrabbit. some
optimization is already done by caching the child->parent link
information. e.g.
see:

http://svn.apache.org/repos/asf/jackrabbit/tags/1.2.3/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/CachingIndexReader.java
(-> the field called 'parents')

That's in the end what the ChildAxisQuery and DescendantSelfAxisQuery use.

regards
  marcel

Michael Neale wrote:
> Yeah I would +1 to that, its something I do fairly often (there is often
a
> lot of info in a path that is relevant to a query - given that we have
gone
> ahead and nicely partitioned our content !).
>
> On 3/13/07, David Johnson <[EMAIL PROTECTED]> wrote:
>>
>> As another example, for each node, perhaps every potential parent path
>> could
>> be added to the index - as an example a node at /a/b/c/d/e/f/g would
have
>> index entries:
>>
>> path1: /a
>> path2: /a/b
>> path3: /a/b/c
>> path4: /a/b/c/d
>> path5: /a/b/c/d/e
>> path6: /a/b/c/d/e/f
>>
>> so queries for specific sub-paths - e.g., select * from my:type where
>> jcr:path like '/a/b/c/%'  could be mapped to a direct lucene match
query
>> i.e.,
>> path3 = /a/b/c
>>
>> The index entry to use for the Lucene query could be determined easily
by
>> simple parsing of the path specified in the query.
>>
>> Perhaps something like this is already in the code.  Is ChildAxisQuery
>> and
>> DescendantSelfAxisQuery currently used for cases like this?
>>
>> -Dave

Re: Query Performance and Optimization

Reply via email to