Re: Optimize search performance

Marcel Reutegger Mon, 11 Jun 2007 01:17:24 -0700

Hi Christoph,

Christoph Kiehl wrote:

I had a look at the search related code during the last days, because weneed better performance for range queries on date fields as well as forsorting by date fields. These are my thoughts so far:
1. Wouldn't it make sense to exclude the index for the "jcr:system" tree(which is located at repository/index by default) if the query toexecute doesn't include items from the "jcr:system" tree.Take for example a query like "my:app//element(*, foo:bar)". This queryonly searches for nodes located under "my:app" which excludes nodes from"jcr:system" and therefore doesn't need to search in the "jcr:system"index.


I think this is doable. Can you please file a jira issue about this?

As the "jcr:system" might grow quite quickly if you create a lotversions it might be worth to exclude it.I'm not sure though how hard it would be to find out if a query needs toinclude the "jcr:system" index.


There are two relevant nodes in the query tree to find that out.

- what's the first location step and does it include the jcr:system tree? Ithink that's an easy one.- does the query contain a jcr:deref node? If there is an intermediate result ofa query may dereference into the jcr:system tree.

2. Lucene uses the FieldCaches to speed up sorting and range querieswhich is exactly what we are after. Those FieldCaches are per IndexReader.Jackrabbit uses an IndexSearcher which searches on a single IndexReaderwhich is most likely to be an instance of CachingMultiReader. So onevery search which builds up a FieldCache this FieldCache instance isassociated with this instance of a CachingMultiReader. On successivequeries which operate on this CachingMultiReader you will get atremendous speedup for queries which can reuse those associatedFieldCache instances.The problem is that Jackrabbit creates a new CachingMultiReader_everytime_ one of the underlying indexes are modified. This means ifyou just change _one_ item in the repository you will need to rebuildall those FieldCaches because the existing FieldCaches are associatedwith the old instance of CachingMultiReader.This does not only lead to slow search response times for queries whichcontains range queries or are sorted by a field but also leads tomassive memory consumption (depending on the size of your indexes)because there might be multiple instances of CachingMultiReaders in useif you have a scenario where a lot of queries and item modifications areexecuted concurrently.As far as I understand the solution is to use a MultiSearcher which usesmultiple IndexReaders. Since due to the merging strategy most of theindexes are stable this means the FieldCaches can be used for a muchlonger time.

this is all correct but does not work because. and you actually already foundout why:

I just tried to quickly modify SearchIndex to use a MultiSearcher withmultiple IndexReaders wrapped by IndexSearchers but wasn't successfulbecause somewhere in DescendantSelfAxisWeight the index readers arerequired to implement HierarchyResolver which ReadOnlyIndexReader doesn't.

Using a multi searcher means that you must be able to execute a query on each ofthe index segments independently. this is not possible because hierarchyinformation is always spread across multiple segments. e.g. a node in onesegment may reference a parent in another segment.

there's also another reason why a multi searcher is not the best solution. itrequires that the fields of a returned FieldDoc contain the values of theindexed property. If there are lots of values to order the complete set ofvalues needs to be read into memory. With the current implementation this is notneeded because there is just a single FieldCache that uses integers instead ofthe real value. See class SharedFieldSortComparator [1]. the downside of thisapproach is that you cannot do a merge sort just using those integers.

a viable solution maybe is a combination of both approaches. use a FieldCacheper index segment (which allows us to cache them for a longer period) but stilluse integer values for ordering of nodes within a segment. Then do a merge sortwith a modified SharedFieldSortComparator that reads property values from theitem state manager when nodes are compared across index segments. even thoughthis requires reading property state, the performance shouldn't suffer too much,I think. the properties would be read anyway when the query result is iterated,so it shouldn't harm if they are read already during query execution.


regards
 marcel

[1]https://svn.apache.org/repos/asf/jackrabbit/tags/1.3/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/SharedFieldSortComparator.java

Re: Optimize search performance

Reply via email to