Re: Query Performance and Optimization

David Johnson Wed, 07 Mar 2007 06:04:15 -0800

On 3/6/07, Marcel Reutegger <[EMAIL PROTECTED]> wrote:

Hi David,

David Johnson wrote:
> Yes, I am using Jackrabbit 1.2.x and I am not seeing that dramatic of a
> difference between 1.1.x and the 1.2.x, although I have not done a
direct
> comparison between the two with the same query suite.

please note that you have to change the configuration to get the
performance
improvement mentioned by jukka. See:
http://issues.apache.org/jira/browse/JCR-651



In my last tests, I think I have done this - through parameters in the
repository.xml file and recreating the entire repository.  Nevertheless, I
did not see that significant of a speed change in query response.  Perhaps I
wasn't using a small enough resultFetchSize (128)?  I also set
respectDocumentOrder to false.  Other suggestions?  I am also using the
1.2.3 code base w/ the bundle persistence manager patch.  Interestingly, I
did see a 15% speed improvement using the bundle persistence manager patch
alone.

It looks like adding
> ordering and or large range queries can significantly impact the query
> time.

I've also noticed this performance hit, but so far didn't find a way to
improve
the situation.

For the interested readers, here are the details:

sorting on custom fields in lucene is imo not very well integrated into
the core
because it was added in a later version (1.4). in addition lucene
primarily
focuses on and is optimized for sorting by relevance.

lucene is also not very well suited for incremental index updates, it is
optimized for large batch updates and then assumes that an index reader is
kept
open for a long time. to accommodate this limitation, jackrabbit uses its
own
index segments to keep index readers open as long as possible. this allows
lucene to use internal caches and speeds up index updates.




In my tests so far, I am not making any changes to the repository while
running the queries.  Could this be considered a best case scenario - i.e.,
the Lucene indexes are not being updated?  What would be the expected
performance change if I have ongoing updates while querying the system?


in certain cases however lucene caches are bound to the top level multi

reader
that spans the jackrabbit index segments. this is unfortunate because this
multi
reader will change whenever the index is updated and as an effect will
invalidate caches that have a dependency to this multi index.

on of the them is the ScoreDocComparator created in FieldSortedHitQueue
which is
cached using a dependency to the top level multi reader. A
ScoreDocComparator
for relevance sorting is trivial and light weight, but a
ScoreDocComparator on a
field is quite heavy weight and is therefore cached in lucene.

some time ago I thought jackrabbit could use one ScoreDocComparator per
index
segment and keep the comparators on index segments but that requires an
even
more heavy weight version of the comparator that keeps the values of the
index
terms.

> I would be very interested in working on and/or looking into
> optimization strategies and/or experiments.  While I can puzzle out the
> code, and the structure of the query syntax tree, any
> pointers/documentation
> would be very much appreciated.

any help is very much appreciated. currently there is only one document
that
gives an overview over the query implementation [1] plus the javadoc in
the
sources of course.




So far, I am just coming up with ideas and if the implementation is simple
enough giving them a try.  So far, no good news to report, although see
below for additional information on experiments with filters.


if you have any questions, feel free to ask on the dev list.


> In the case of an order by query and large range queries, it looks like
> significant time is spent in the
> org.apache.jackrabbit.core.query.lucene.SharedFiledSortComparator
> class, and that the result set is being walked in order to make sure
that
> each result matches the query parameters and probably for sorting
purposes.
> In the best case it would be good to have a pre-sorted index so that
sorted
> results could be returned much more quickly.  Similarly, the ability to
> define indexes on node properties could be used to speed results.  Both
of
> these types of indexes could be modeled and probably implemented
similarly
> to well-known database indexing schemes.

jackrabbit already indexes properties using lucene. that is, the lucene
query
handler in jackrabbit already has the ability to enumerate documents (or
nodes
respectively) in sorted order for a certain property. however, the problem
here
is, that lucene initially orders documents using document numbers while it
calculates the matching documents. This is different to database systems
where
the dbms already uses presorted sets according to the order by clause. so,
I'm
not sure how to optimize sorting in lucene in a dbms way.

hmm, maybe we could use scoring, but I'm not sure how reliable this would
work... instead of using a regular score how well a document matches
against a
certain term we could implements a score value according to the term value
and
the sort order. this would much better leverage the internal structure and
workings of lucene. anyone tried this before?

> Finally, there are also optimizations - more tailored to JSR-170 needs -
> that could be implemented with Lucene filters - specifically Node Type
> query
> constraints - e.g., select * from my:mixin - the Filter could be defined
to
> limit query searches / results to only those nodes that fulfill the
> specific
> node type constraint.  These filters could be defined for each custom
node
> type defined in the system and pre-calculated.

again filters are one of the components in a lucene query that have a
dependency
on the top level index reader. because in jackrabbit the index reader
changes
with every update the filter will have to be re-calculated with every
update.
unless we find a way to create and use filters per index segments.

I'm not sure if this would even give much of a performance improvement,
because
node type constraints are basically simple term queries that are very fast
in
lucene because those are the building blocks of more complex queries.




So far the experiments that I have done with Lucene filters and Jackrabbit
have been disappointing.  I essentially used a QueryFilter and passed that
to the IndexSearcher in SearchIndex.executeQuery.  I created filters for one
sub-part of a query - NodeType matching created in public Object
visit(NodeTypeQueryNode node, Object data) of the LuceneQueryBuilder class.
Rather than adding its part of the query to the larger Lucene query, I had
the function create a QueryFilter using the sub-part of the Lucene query
that would have been created by the function, and then returning null
instead of the query.  The filter was then later combined with the rest of
the query in IndexSearcher.  Finally, the filter was only created once, and
added to a filter map, so that it could be reused for queries against the
same nodetype.

I also noticed that a new IndexSearcher was created for each query processed
- is this breaking the Lucene cache of filters?

As a side note, all tests passed during the normal build.  In any case,
performance with this modification was significantly worse.

I have not worked with Lucene Filters before, so I am not familiar with
> their speed - it might be interesting to get input from someone that has
> more experience with the run time behavior of Filters.  Nevertheless,
the
> Lucene docs mention Filters as a method to get around large range
queries
> that inevitably break the 1024 Lucene query term limit.

filters in lucene are fast once you have them, but usually expensive to
calculate. because you need to re-calculate them again after you changed
the
index it is important to minimize the number of time you do that. as
mentioned
above I think using filters only makes sense if we can use one filter per
index
segment and only re-calculate those filters for the youngest segment or
for
merged segments.




This is what I was hoping for, although I am wondering if the new
IndexSearcher that is created for each query execution is eliminating the
filter cache?

Further information on the Query Syntax Tree that is used by the
> LuceneQueryBuilder - and "safe" ways to modify it  i.e., I would rather
> only
> modify the Query Syntax Tree and continue to use the LuceneQueryBuilder
for
> most of the query processing - would be appreciated.

why do you want to modify the query tree?




I needed to "modify" the query tree so that the parts I "optimized" (i.e.,
removed) wouldn't create additional Lucene query terms.  It seems that
having the visitor function return null effectively removes any terms that
may have been created by that part of the visitor - in LuceneQueryBuilder.
Is this correct?


regards

  marcel

[1] http://jackrabbit.apache.org/doc/arch/operate/query.html



In my explorations, I have noticed that a range query e.g., (myDate >
startDate AND myDate < endDate) seem to be translated into two Lucene range
queries that are ANDed together that looks like - ((startDate < myDate <
MAX_DATE) AND (MIN_DATE < myDate < endDate)).  I am guessing that Lucene
calculates each of the two range queries in isolation, and then ANDs the
results together - in a sense forcing the walk of the entire document set.
Just from the look of it, it seems inefficient, although I am not sure how
much better it would be to translate the query to a single range query -
this transformation would also require a little analysis on the Query Syntax
Tree - or a post optimization on the Lucene query.

Finally, some ideas on indexing.  It seems that there are two or perhaps
three choices that could be made for improving indexing:

1) Augment the already existing Lucene index with additional information to
speed certain types of queries - e.g., range queries.  For example, it might
be possible to index each of the "bytes" of a date or long and optimize the
query using this additional information.

2) Create an external indexing structure for nodetypes and fields that would
mirror similar structures in a database.  Again this data could be used to
optimize range queries as well as "sorted" results.

3) Use the database to provide the indexing structures.

For ease of development, I tend to prefer #1, although I am worried that
ultimately this is becoming a fitting a square peg in a round hole - i.e.,
Lucene really was not developed to serve as a general purpose database
indexing engine.  I am also worried about #3 since the ultimate db
persistence engine can be changed depending how the repository is
configured.  #2 seems to also want "language" extensions so that a user or
administrator can define indexes and thereby tune the query processing of
the repository, similarly to how a database addresses these problems.

Short term, I would like to see if there is any low hanging fruit in #1 that
could be used to significantly speed up several types of queries.  I would
also like to create some "theoretical" benchmarks to shoot for - i.e., what
is the fastest that we could expect to access a node by UUID?  through a
range query?  matching a particular node type and value?  Since database
technology is mature and well optimized, this might be a good place to look
for "the best that would could ever expect" as far as query speed goes and
then compare Jackrabbit's results (query processing time).  I am not
interested in matching DB speeds one-to-one, although I think it can be a
useful benchmark that might get us thinking about why certain mismatches
occur.

-Dave

Re: Query Performance and Optimization

Reply via email to