Re: First search is slow after updating index .. subsequent searches very fast

Mark Miller Fri, 22 Dec 2006 11:02:54 -0800

I am no expert, but as I gloss over the code this is what I see happensfor a sort (sometimes the less experienced has to get it wrong before anexpert will jump in with some good info <G> *hint to experts*):

The field cache caches <document : term> pairs. When you sort on a fieldyou don't want to have to extract the content of the field as you aredoing the sort. So the first time you do a sort, the fieldcache isloaded up that stores the term to sort on for each document id.

For numerics this involves loading up an array that is keyed by doc idwith a value of the numeric (I didn't look at the float handling that Ithink I saw). i.e. Array(documentid) = termtextThe field cache for an integer sort is 32 bits times the number of docsthen (termtext would be an int).For a String sort an array is made of all of the Terms in the index(terms in the field being sorted on I believe) and another array is madethat indexes into that term array. So to see what you are sorting on fordocument 4 you would get the value of the first array at position 3 anduse the result as the index into the Term array. i.e. Array(documentId)= index , Terms(index) = termtextThe size of a String fieldcache is going to be the size of those twoArrays : 32 bits X number of docs + size of all of the terms (again Ithink this is just the terms in that field, but I have not seen anyonesay that before).

If you followed that horrible example of an explanation than you can seewhy it is all done at once when you first ask for a sort. The arrayscreated are stored in a weakhashmap that is keyed on by the IndexReaderthat was used for the search (A Searcher contains an IndexReader). Soevery time you open a new Searcher and do a field sorted search it willneed to cache the <document : term> pairs (using those arrays I talkabout above).

The actual sorting appears to happen just like with relevancy scoresorting....using a priority queue that is loaded as a HitCollectorvisits each document.

There may be a little more to the story -- I think I saw thatComparators used for sorting are also cached, but someone else will haveto correct me...I am out of time.

The way to avoid this warm up time that takes place (due to loading upthose fieldcaches), is to pre-warm a Searcher. When an update is made tothe index, instead of just opening a new Searcher, keep using the oldSearcher to serve search requests, start up a new searcher in adifferent thread and perform a sorted search on it, then replace thestale Searcher with the new warmed up Searcher.



- Mark

Bryan Dotzour wrote:

Thanks for that tidbit Mark.  I was just looking through the LIA book
and stumbled across this sentence under the "5.1.9 Performance effect of
sorting" section.  It says: "[When sorting by a String type] each unique
term is also cached for each document. Only the actual fields used for
sorting are cached in this manner."

In the case that I originally described, our default sorting mechanism
is an alphabetical sort on the title of each object returned in the
search.  So I take this excerpt from the book to mean that the
FieldCache has to read each title value from each document in order to
perform the sort.  That pretty much sounds like exactly what you're
saying Mark.

I guess the only question left in my mind is, does the FieldCache have
to read every value for every document in the entire index to perform
the sort, or just the values in the documents returned in the search?
My guess would be the latter although this one index seems much slower
than all of the others and the only difference is the sheer number of
items in the index.

-----Original Message-----

From: Mark Miller [mailto:[EMAIL PROTECTED]Sent: Thursday, December 21, 2006 2:48 PM

To: java-user@lucene.apache.org
Subject: Re: First search is slow after updating index .. subsequent
searches very fast

Since you say you are sorting on a field the bulk of the time will bedoing the sort and caching it (FieldCache). Subsequent searches use that

cache to avoid paying the full sort cost again. If you where doingrelevancy sorting you would not experience such a big delay.


- Mark

Bryan Dotzour wrote:

Otis thanks for your suggestion, it seems to be working pretty well!
I'm just curious if you (or anyone else) could describe what is

actually

happening during that initial query that ends up taking so much time.
We have several different indexes for different types of objects and
it's only this one index that exhibits this kind of behavior.  Is it
something related to the size of the index, or the number of fields,

or

how fragmented the index is?

I'm just trying to get a little better understanding of what is going

on

under the covers there.  I'll spend some time with the source to see

if

I can figure it out, but any tips from the experts would be much
appreciated. =)

Thanks again!
Bryan

-----Original Message-----

From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]Sent: Wednesday, December 20, 2006 4:28 PM

To: java-user@lucene.apache.org
Subject: Re: First search is slow after updating index .. subsequent
searches very fast

To populate FieldCache, the number of matches doesn't matter.  There

is

no need to be scrimy there - you don't really save anything by running

query that matches only a few docs.  Just run something that looks

like

a common query.

For warming up new indices, one can also use the `dd' trick under

UNIX.

Otis

----- Original Message ----
From: Bryan Dotzour <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, December 20, 2006 5:23:40 PM
Subject: RE: First search is slow after updating index .. subsequent
searches very fast

One question about this, Otis... When "warming up" the new searcher,
should the query return a lot of results, or does it matter?  Can I

just

do like an ID = X query and get one document back?  Is that sufficient
or is it better to run a query that will get lots of hits?

Thanks again,
Bryan

-----Original Message-----

From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]Sent: Wednesday, December 20, 2006 3:28 PM

To: java-user@lucene.apache.org
Subject: Re: First search is slow after updating index .. subsequent
searches very fast

All sounds good.  Opening a new IndexReader can take a bit of time.

If

you use sorting of any kind other than default sorting by relevance,
this delay on the first search is also probably caused by the lazy
FieldCache population.  The cure for that is to open a new
IndexReader/Searcher before you close the old one, warm it up with a
query + sort, and then switch IndexReader/Searchers, closing the old
one.

Otis

----- Original Message ----
From: Bryan Dotzour <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, December 20, 2006 3:59:19 PM
Subject: First search is slow after updating index .. subsequent
searches very fast

I'm investigating some performance issues with the way we're using
Lucene in our web app and am interested if anyone could shed some

light

on what might be going on.  Hopefully I can provide enough

information,

please let me know if there's more I can give.

We're using Lucene 2.0.0 and I'm currently working with disk-based
indexing (although in production I'll want to be using RAM indexing).
In our environment, we build up our Lucene index at application start

up

time and then we optimize the index.  From then on, updates and

deletes

to the index occur fairly frequently but we don't optimize until the
middle of the night when the impact would be at its minimum.  After a
while, what I see is that searches will be very fast (~400 ms) until I
make a modification that will require a single document to be
re-indexed.  Immediately after that has occurred, the next search will
take substantially longer (sometimes up to ~25s).  After that search

has

run, the next search will be back at the ~400ms time.

Our algorithm for handling the updates is as follows:

1.       open an IndexReader on the directory

2.       delete the document using the reader

3.       close the reader

4.       open an IndexWriter

5.       add the new document using the writer

6.       close the writer

For searches:

1.    We cache off an IndexReader for the index, as well as an
IndexSearcher, which uses that reader
2.    When a search is initiated we check to see if the version of the
index has changed using getCurrentVersion()
3.    If it has changed, we close our IndexSearcher, close the
IndexReader and re-open them both

Anything sound non-standard in that workflow?  Does anyone have an

idea

of what might be happening during that slow down?

Thanks for your time,

Bryan

(For a little more info, here is a very common stack trace snippet

that

I gather when the "slow search" is running.  It seems much of the time
is spent in MultiReader or MultiTermDocs)

org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(Com

poundFileReader.java:214)

org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.jav

a:64)

org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.j

ava:33)
      org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:56)
      org.apache.lucene.index.TermBuffer.read(TermBuffer.java:62)
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:117)

org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:148)

org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:15

7)
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:151)org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:50)org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:392)
      org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:348)
      org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)
      org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)
      org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)

org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:171)

org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:153)

org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:349)

org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedH

itQueue.java:346)

org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSo

rtedHitQueue.java:189)

org.apache.lucene.search.FieldSortedHitQueue.(FieldSortedHitQueue.java:5

8)

org.apache.lucene.search.TopFieldDocCollector.(TopFieldDocCollector.java

:40)

org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:108)

      org.apache.lucene.search.Hits.getMoreDocs(Hits.java:65)
      org.apache.lucene.search.Hits.(Hits.java:52)
      org.apache.lucene.search.Searcher.search(Searcher.java:53)





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: First search is slow after updating index .. subsequent searches very fast

Reply via email to