-- snip --
yes, this should ensure that caching in lucene is used wherever possible.
Even
though there might be bugs that prevent this. Just like this one:
http://svn.apache.org/viewvc?view=rev&revision=506908
which prevented the re-use of SharedFiledSortComparator even if nothing
changed
between two query execution calls. You might want to check if this patch
improves the situation for you.
That "patch" actually helped a fair amount - from 60 seconds to run 100
queries down to around 40 seconds. Not too bad. Any reason why this hasn't
made it into the current releases?
What would be the expected
> performance change if I have ongoing updates while querying the system?
depending on the query the performance will be just a bit slower or
significantly slower. e.g. an order by in your query on a property that is
nearly on every node will significantly slow down the query (-> see my
previous
post about lucenes ability to cache the SharedFiledSortComparator).
> So far the experiments that I have done with Lucene filters and
Jackrabbit
> have been disappointing. I essentially used a QueryFilter and passed
that
> to the IndexSearcher in SearchIndex.executeQuery. I created filters for
> one
> sub-part of a query - NodeType matching created in public Object
> visit(NodeTypeQueryNode node, Object data) of the LuceneQueryBuilder
class.
> Rather than adding its part of the query to the larger Lucene query, I
had
> the function create a QueryFilter using the sub-part of the Lucene query
> that would have been created by the function, and then returning null
> instead of the query. The filter was then later combined with the rest
of
> the query in IndexSearcher. Finally, the filter was only created once,
and
> added to a filter map, so that it could be reused for queries against
the
> same nodetype.
I still think that caching the documents that match a node type will not
help
much because those are simple term query, which are very efficient in
lucene.
Yes, it didn't seem to help that much.
I also noticed that a new IndexSearcher was created for each query
> processed
> - is this breaking the Lucene cache of filters?
No, lucene uses the index reader as the key for its caches. e.g. in
QueryFilter.bits():
synchronized (cache) { // check cache
BitSet cached = (BitSet) cache.get(reader);
if (cached != null) {
return cached;
}
}
> This is what I was hoping for, although I am wondering if the new
> IndexSearcher that is created for each query execution is eliminating
the
> filter cache?
afaics all filter caches are tied to the index reader and not the searcher
instance.
> I needed to "modify" the query tree so that the parts I "optimized" (i.e
.,
> removed) wouldn't create additional Lucene query terms. It seems that
> having the visitor function return null effectively removes any terms
that
> may have been created by that part of the visitor - in
LuceneQueryBuilder.
> Is this correct?
I think writing your own customized LuceneQueryBuilder is easier than
modifying
the query tree.
Ok, I will give that a try.thethe
In my explorations, I have noticed that a range query e.g., (myDate >
> startDate AND myDate < endDate) seem to be translated into two Lucene
range
> queries that are ANDed together that looks like - ((startDate < myDate <
> MAX_DATE) AND (MIN_DATE < myDate < endDate)). I am guessing that Lucene
> calculates each of the two range queries in isolation, and then ANDs the
> results together - in a sense forcing the walk of the entire document
set.
that's correct.
> Just from the look of it, it seems inefficient, although I am not sure
how
> much better it would be to translate the query to a single range query -
> this transformation would also require a little analysis on the Query
> Syntax
> Tree - or a post optimization on the Lucene query.
yes, this would be very useful and probably improve the performance
significantly for certain cases. the LuceneQueryBuilder should probably do
this
optimization by analyzing the query tree.
I will see if I can puzzle out a solution to this scenario, and modify the
LuceneQueryBuilder code.
Finally, some ideas on indexing. It seems that there are two or perhaps
> three choices that could be made for improving indexing:
>
> 1) Augment the already existing Lucene index with additional information
to
> speed certain types of queries - e.g., range queries. For example, it
> might
> be possible to index each of the "bytes" of a date or long and optimize
the
> query using this additional information.
Can you please elaborate on how this would work?
I think I was again focusing on range queries and giving Lucene some way of
filtering out subsets of the document set, so that the whole document set
wouldn't have to be walked. For the date range query the from and to dates
would most likely share some set of most significant bytes - these bytes
could just be passed to Lucene as a direct match thereby reducing the subset
of the collection that would by walked. If the range query is fixed this
"optimization" would be unnecessary. Nevertheless, I still wonder if there
is additional information that could be stored in Lucene to augment the
index and improve query processing.
2) Create an external indexing structure for nodetypes and fields that
> would
> mirror similar structures in a database. Again this data could be used
to
> optimize range queries as well as "sorted" results.
I'm not sure how you would connect those two structures because lucene
uses
ephemeral document number:
"IndexReader: For efficiency, in this API documents are often referred to
via
document numbers, non-negative integers which each name a unique document
in the
index. These document numbers are ephemeral--they may change as documents
are
added to and deleted from an index. Clients should thus not rely on a
given
document having the same number between sessions."
In this case I was considering using the node UUID as the cross-index join
parameter. Still, there is the problem of combining the results from two
different indexes.
3) Use the database to provide the indexing structures.
To me this seems to be a very interesting option, though it requires
considerable effort.
Yes, I agree, this is an interesting option, and does seem that it would
take a fair amount of effort. Your comments on the user list to this same
thread seems like a start to the thought process needed. I am not very
familiar with the details of the PM, although I do think that bringing
together data storage and indexing will help with improving query processing
speed, as well as help with some data integrity issues that have been
discussed in other threads.
Over the weekend, I will see if I can come up with a solution to the range
query issue discussed above.
-Dave