David Johnson wrote:

Out of the Jackrabbit code,
DescendantSelfAxisQuery.DescendantSelfAxisScorer.next()
is now taking the most time while executing my query suite - taking 68% of
the time, within it, calls to
DescendantSelfAxisQuery.DescendantSelfAxisScorer.calculateSubHits() taking
the majority of time (basically all of the time).  Then calls to
BooleanScorer2.score(HitCollector) - back to Lucene code - is taking the
majority of time.  If more specific profiling data is desired, please feel
free to ask.  I can also share the profile data in the form of a Netbeans
profile snapshot.

To my understanding, calculateSubHits() can be divided into to parts:

- The first part queries all nodes that are directly addressed by your xpath (for /foo/bar//* this will be /foo/bar[1], /foo/bar[2], ...). This query is quite fast in my experience. - The second part does the actual work, i.e. the lucene query on the node attributes. I don't think there is much potential for improvement here unless you dig into lucene itself.

On the contrary to DescendantSelfAxisScorer.next(). This method takes the result from part two (subHits) and filters all nodes that are not part of the result of part one (contextHits) or a child node of one of the nodes in contextHits. To filter these nodes a lot of parent-child relations have to resolved. I think there should be some caching potential for contextHits here if you use the same basis like /foo/bar//* for a lot of queries. But this cache would only be valid for a particular IndexReader, that is to say it will only be beneficial if your repository is quite stable.

I was digging a bit into Jackrabbit today and found another place where some caching did provide a substantial performance gain to queries which check one attribute for more than one value (like /foo/[EMAIL PROTECTED]:bar='john' or foo:bar='doe']). The BitSet in calculateDocFilter() is right now created twice for the query above. On large repositories this takes about 200ms per BitSet on my machine for a particular field. Caching these BitSets per IndexReader and field in a WeakHashMap with the IndexReader as a key gave me some real improvements. But this caching is as well only beneficial for repositories that are not heavily changing, as this will lead to the creation of new IndexReaders and invalidate the Cache.

As both mentioned caches rely heavily on IndexReader reuse it would probably be better to have caches per index segment as someone mentioned in the thread about using Lucene filters, as segment are relatively stable.

That's what I've found out until now. I'll do some more research the next days, as we definitely need to improve query performance for our application.

I would like to hear some comments from the JackRabbit gurus and feel free to correct me - I just started ;)

Cheers,
Christoph

Reply via email to