On 3/6/07, Marcel Reutegger <[EMAIL PROTECTED]> wrote:
Hi David, David Johnson wrote: > Yes, I am using Jackrabbit 1.2.x and I am not seeing that dramatic of a > difference between 1.1.x and the 1.2.x, although I have not done a direct > comparison between the two with the same query suite. please note that you have to change the configuration to get the performance improvement mentioned by jukka. See: http://issues.apache.org/jira/browse/JCR-651
In my last tests, I think I have done this - through parameters in the repository.xml file and recreating the entire repository. Nevertheless, I did not see that significant of a speed change in query response. Perhaps I wasn't using a small enough resultFetchSize (128)? I also set respectDocumentOrder to false. Other suggestions? I am also using the 1.2.3 code base w/ the bundle persistence manager patch. Interestingly, I did see a 15% speed improvement using the bundle persistence manager patch alone.
It looks like adding > ordering and or large range queries can significantly impact the query > time. I've also noticed this performance hit, but so far didn't find a way to improve the situation. For the interested readers, here are the details: sorting on custom fields in lucene is imo not very well integrated into the core because it was added in a later version (1.4). in addition lucene primarily focuses on and is optimized for sorting by relevance. lucene is also not very well suited for incremental index updates, it is optimized for large batch updates and then assumes that an index reader is kept open for a long time. to accommodate this limitation, jackrabbit uses its own index segments to keep index readers open as long as possible. this allows lucene to use internal caches and speeds up index updates.
In my tests so far, I am not making any changes to the repository while running the queries. Could this be considered a best case scenario - i.e., the Lucene indexes are not being updated? What would be the expected performance change if I have ongoing updates while querying the system? in certain cases however lucene caches are bound to the top level multi
reader that spans the jackrabbit index segments. this is unfortunate because this multi reader will change whenever the index is updated and as an effect will invalidate caches that have a dependency to this multi index. on of the them is the ScoreDocComparator created in FieldSortedHitQueue which is cached using a dependency to the top level multi reader. A ScoreDocComparator for relevance sorting is trivial and light weight, but a ScoreDocComparator on a field is quite heavy weight and is therefore cached in lucene. some time ago I thought jackrabbit could use one ScoreDocComparator per index segment and keep the comparators on index segments but that requires an even more heavy weight version of the comparator that keeps the values of the index terms. > I would be very interested in working on and/or looking into > optimization strategies and/or experiments. While I can puzzle out the > code, and the structure of the query syntax tree, any > pointers/documentation > would be very much appreciated. any help is very much appreciated. currently there is only one document that gives an overview over the query implementation [1] plus the javadoc in the sources of course.
So far, I am just coming up with ideas and if the implementation is simple enough giving them a try. So far, no good news to report, although see below for additional information on experiments with filters. if you have any questions, feel free to ask on the dev list.
> In the case of an order by query and large range queries, it looks like > significant time is spent in the > org.apache.jackrabbit.core.query.lucene.SharedFiledSortComparator > class, and that the result set is being walked in order to make sure that > each result matches the query parameters and probably for sorting purposes. > In the best case it would be good to have a pre-sorted index so that sorted > results could be returned much more quickly. Similarly, the ability to > define indexes on node properties could be used to speed results. Both of > these types of indexes could be modeled and probably implemented similarly > to well-known database indexing schemes. jackrabbit already indexes properties using lucene. that is, the lucene query handler in jackrabbit already has the ability to enumerate documents (or nodes respectively) in sorted order for a certain property. however, the problem here is, that lucene initially orders documents using document numbers while it calculates the matching documents. This is different to database systems where the dbms already uses presorted sets according to the order by clause. so, I'm not sure how to optimize sorting in lucene in a dbms way. hmm, maybe we could use scoring, but I'm not sure how reliable this would work... instead of using a regular score how well a document matches against a certain term we could implements a score value according to the term value and the sort order. this would much better leverage the internal structure and workings of lucene. anyone tried this before? > Finally, there are also optimizations - more tailored to JSR-170 needs - > that could be implemented with Lucene filters - specifically Node Type > query > constraints - e.g., select * from my:mixin - the Filter could be defined to > limit query searches / results to only those nodes that fulfill the > specific > node type constraint. These filters could be defined for each custom node > type defined in the system and pre-calculated. again filters are one of the components in a lucene query that have a dependency on the top level index reader. because in jackrabbit the index reader changes with every update the filter will have to be re-calculated with every update. unless we find a way to create and use filters per index segments. I'm not sure if this would even give much of a performance improvement, because node type constraints are basically simple term queries that are very fast in lucene because those are the building blocks of more complex queries.
So far the experiments that I have done with Lucene filters and Jackrabbit have been disappointing. I essentially used a QueryFilter and passed that to the IndexSearcher in SearchIndex.executeQuery. I created filters for one sub-part of a query - NodeType matching created in public Object visit(NodeTypeQueryNode node, Object data) of the LuceneQueryBuilder class. Rather than adding its part of the query to the larger Lucene query, I had the function create a QueryFilter using the sub-part of the Lucene query that would have been created by the function, and then returning null instead of the query. The filter was then later combined with the rest of the query in IndexSearcher. Finally, the filter was only created once, and added to a filter map, so that it could be reused for queries against the same nodetype. I also noticed that a new IndexSearcher was created for each query processed - is this breaking the Lucene cache of filters? As a side note, all tests passed during the normal build. In any case, performance with this modification was significantly worse.
I have not worked with Lucene Filters before, so I am not familiar with > their speed - it might be interesting to get input from someone that has > more experience with the run time behavior of Filters. Nevertheless, the > Lucene docs mention Filters as a method to get around large range queries > that inevitably break the 1024 Lucene query term limit. filters in lucene are fast once you have them, but usually expensive to calculate. because you need to re-calculate them again after you changed the index it is important to minimize the number of time you do that. as mentioned above I think using filters only makes sense if we can use one filter per index segment and only re-calculate those filters for the youngest segment or for merged segments.
This is what I was hoping for, although I am wondering if the new IndexSearcher that is created for each query execution is eliminating the filter cache?
Further information on the Query Syntax Tree that is used by the > LuceneQueryBuilder - and "safe" ways to modify it i.e., I would rather > only > modify the Query Syntax Tree and continue to use the LuceneQueryBuilder for > most of the query processing - would be appreciated. why do you want to modify the query tree?
I needed to "modify" the query tree so that the parts I "optimized" (i.e., removed) wouldn't create additional Lucene query terms. It seems that having the visitor function return null effectively removes any terms that may have been created by that part of the visitor - in LuceneQueryBuilder. Is this correct? regards
marcel [1] http://jackrabbit.apache.org/doc/arch/operate/query.html
In my explorations, I have noticed that a range query e.g., (myDate > startDate AND myDate < endDate) seem to be translated into two Lucene range queries that are ANDed together that looks like - ((startDate < myDate < MAX_DATE) AND (MIN_DATE < myDate < endDate)). I am guessing that Lucene calculates each of the two range queries in isolation, and then ANDs the results together - in a sense forcing the walk of the entire document set. Just from the look of it, it seems inefficient, although I am not sure how much better it would be to translate the query to a single range query - this transformation would also require a little analysis on the Query Syntax Tree - or a post optimization on the Lucene query. Finally, some ideas on indexing. It seems that there are two or perhaps three choices that could be made for improving indexing: 1) Augment the already existing Lucene index with additional information to speed certain types of queries - e.g., range queries. For example, it might be possible to index each of the "bytes" of a date or long and optimize the query using this additional information. 2) Create an external indexing structure for nodetypes and fields that would mirror similar structures in a database. Again this data could be used to optimize range queries as well as "sorted" results. 3) Use the database to provide the indexing structures. For ease of development, I tend to prefer #1, although I am worried that ultimately this is becoming a fitting a square peg in a round hole - i.e., Lucene really was not developed to serve as a general purpose database indexing engine. I am also worried about #3 since the ultimate db persistence engine can be changed depending how the repository is configured. #2 seems to also want "language" extensions so that a user or administrator can define indexes and thereby tune the query processing of the repository, similarly to how a database addresses these problems. Short term, I would like to see if there is any low hanging fruit in #1 that could be used to significantly speed up several types of queries. I would also like to create some "theoretical" benchmarks to shoot for - i.e., what is the fastest that we could expect to access a node by UUID? through a range query? matching a particular node type and value? Since database technology is mature and well optimized, this might be a good place to look for "the best that would could ever expect" as far as query speed goes and then compare Jackrabbit's results (query processing time). I am not interested in matching DB speeds one-to-one, although I think it can be a useful benchmark that might get us thinking about why certain mismatches occur. -Dave