I am no expert, but as I gloss over the code this is what I see happens for a sort (sometimes the less experienced has to get it wrong before an expert will jump in with some good info <G> *hint to experts*):

The field cache caches <document : term> pairs. When you sort on a field you don't want to have to extract the content of the field as you are doing the sort. So the first time you do a sort, the fieldcache is loaded up that stores the term to sort on for each document id.

For numerics this involves loading up an array that is keyed by doc id with a value of the numeric (I didn't look at the float handling that I think I saw). i.e. Array(documentid) = termtext The field cache for an integer sort is 32 bits times the number of docs then (termtext would be an int). For a String sort an array is made of all of the Terms in the index (terms in the field being sorted on I believe) and another array is made that indexes into that term array. So to see what you are sorting on for document 4 you would get the value of the first array at position 3 and use the result as the index into the Term array. i.e. Array(documentId) = index , Terms(index) = termtext The size of a String fieldcache is going to be the size of those two Arrays : 32 bits X number of docs + size of all of the terms (again I think this is just the terms in that field, but I have not seen anyone say that before).

If you followed that horrible example of an explanation than you can see why it is all done at once when you first ask for a sort. The arrays created are stored in a weakhashmap that is keyed on by the IndexReader that was used for the search (A Searcher contains an IndexReader). So every time you open a new Searcher and do a field sorted search it will need to cache the <document : term> pairs (using those arrays I talk about above).

The actual sorting appears to happen just like with relevancy score sorting....using a priority queue that is loaded as a HitCollector visits each document.

There may be a little more to the story -- I think I saw that Comparators used for sorting are also cached, but someone else will have to correct me...I am out of time.

The way to avoid this warm up time that takes place (due to loading up those fieldcaches), is to pre-warm a Searcher. When an update is made to the index, instead of just opening a new Searcher, keep using the old Searcher to serve search requests, start up a new searcher in a different thread and perform a sorted search on it, then replace the stale Searcher with the new warmed up Searcher.


- Mark

Bryan Dotzour wrote:
Thanks for that tidbit Mark.  I was just looking through the LIA book
and stumbled across this sentence under the "5.1.9 Performance effect of
sorting" section.  It says: "[When sorting by a String type] each unique
term is also cached for each document. Only the actual fields used for
sorting are cached in this manner."

In the case that I originally described, our default sorting mechanism
is an alphabetical sort on the title of each object returned in the
search.  So I take this excerpt from the book to mean that the
FieldCache has to read each title value from each document in order to
perform the sort.  That pretty much sounds like exactly what you're
saying Mark.

I guess the only question left in my mind is, does the FieldCache have
to read every value for every document in the entire index to perform
the sort, or just the values in the documents returned in the search?
My guess would be the latter although this one index seems much slower
than all of the others and the only difference is the sheer number of
items in the index.

-----Original Message-----
From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Thursday, December 21, 2006 2:48 PM
To: java-user@lucene.apache.org
Subject: Re: First search is slow after updating index .. subsequent
searches very fast

Since you say you are sorting on a field the bulk of the time will be doing the sort and caching it (FieldCache). Subsequent searches use that

cache to avoid paying the full sort cost again. If you where doing relevancy sorting you would not experience such a big delay.

- Mark

Bryan Dotzour wrote:
Otis thanks for your suggestion, it seems to be working pretty well!
I'm just curious if you (or anyone else) could describe what is
actually
happening during that initial query that ends up taking so much time.
We have several different indexes for different types of objects and
it's only this one index that exhibits this kind of behavior.  Is it
something related to the size of the index, or the number of fields,
or
how fragmented the index is?

I'm just trying to get a little better understanding of what is going
on
under the covers there.  I'll spend some time with the source to see
if
I can figure it out, but any tips from the experts would be much
appreciated. =)

Thanks again!
Bryan

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 20, 2006 4:28 PM
To: java-user@lucene.apache.org
Subject: Re: First search is slow after updating index .. subsequent
searches very fast

To populate FieldCache, the number of matches doesn't matter.  There
is
no need to be scrimy there - you don't really save anything by running
a
query that matches only a few docs.  Just run something that looks
like
a common query.

For warming up new indices, one can also use the `dd' trick under
UNIX.
Otis

----- Original Message ----
From: Bryan Dotzour <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, December 20, 2006 5:23:40 PM
Subject: RE: First search is slow after updating index .. subsequent
searches very fast

One question about this, Otis... When "warming up" the new searcher,
should the query return a lot of results, or does it matter?  Can I
just
do like an ID = X query and get one document back?  Is that sufficient
or is it better to run a query that will get lots of hits?

Thanks again,
Bryan

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 20, 2006 3:28 PM
To: java-user@lucene.apache.org
Subject: Re: First search is slow after updating index .. subsequent
searches very fast

All sounds good.  Opening a new IndexReader can take a bit of time.
If
you use sorting of any kind other than default sorting by relevance,
this delay on the first search is also probably caused by the lazy
FieldCache population.  The cure for that is to open a new
IndexReader/Searcher before you close the old one, warm it up with a
query + sort, and then switch IndexReader/Searchers, closing the old
one.

Otis

----- Original Message ----
From: Bryan Dotzour <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, December 20, 2006 3:59:19 PM
Subject: First search is slow after updating index .. subsequent
searches very fast

I'm investigating some performance issues with the way we're using
Lucene in our web app and am interested if anyone could shed some
light
on what might be going on.  Hopefully I can provide enough
information,
please let me know if there's more I can give.

We're using Lucene 2.0.0 and I'm currently working with disk-based
indexing (although in production I'll want to be using RAM indexing).
In our environment, we build up our Lucene index at application start
up
time and then we optimize the index.  From then on, updates and
deletes
to the index occur fairly frequently but we don't optimize until the
middle of the night when the impact would be at its minimum.  After a
while, what I see is that searches will be very fast (~400 ms) until I
make a modification that will require a single document to be
re-indexed.  Immediately after that has occurred, the next search will
take substantially longer (sometimes up to ~25s).  After that search
has
run, the next search will be back at the ~400ms time.

Our algorithm for handling the updates is as follows:

1.       open an IndexReader on the directory

2.       delete the document using the reader

3.       close the reader

4.       open an IndexWriter

5.       add the new document using the writer

6.       close the writer

For searches:

1.    We cache off an IndexReader for the index, as well as an
IndexSearcher, which uses that reader
2.    When a search is initiated we check to see if the version of the
index has changed using getCurrentVersion()
3.    If it has changed, we close our IndexSearcher, close the
IndexReader and re-open them both

Anything sound non-standard in that workflow?  Does anyone have an
idea
of what might be happening during that slow down?

Thanks for your time,

Bryan

(For a little more info, here is a very common stack trace snippet
that
I gather when the "slow search" is running.  It seems much of the time
is spent in MultiReader or MultiTermDocs)

org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(Com
poundFileReader.java:214)
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.jav
a:64)
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.j
ava:33)
      org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56)
      org.apache.lucene.index.TermBuffer.read(TermBuffer.java:62)
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:117)
org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:148)
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:15
7)
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:151) org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:50) org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:392)
      org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:348)
      org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)
      org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)
      org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:171)
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:153)
org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:349)
org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedH
itQueue.java:346)
org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSo
rtedHitQueue.java:189)
org.apache.lucene.search.FieldSortedHitQueue.(FieldSortedHitQueue.java:5
8)
org.apache.lucene.search.TopFieldDocCollector.(TopFieldDocCollector.java
:40)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:108)
      org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
      org.apache.lucene.search.Hits.(Hits.java:52)
      org.apache.lucene.search.Searcher.search(Searcher.java:53)





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to