Re: java gc with a frequently changing index?

Tim Sturge Mon, 30 Jul 2007 14:04:39 -0700

Oh, yeah, I know now :-). But I really do have a requirement to showsearch results from items that came in 5 seconds ago. We have anapplication where a common usage pattern is


add an item
navigate to another item
search for the first item (to associate it with the second item)


and the gap between step 1 and step 3 is not very long.

Right now, I get a notification within a few hundred msec that the itemhas been added. I just don't see why it is hard (in theory anyway,lucene's implementation notwithstanding) to put that on the end of theindex I'm currently searching. I have lots of CPU available.

Can you tell me the JIRA issue? What kind of patch would lucene devs belikely to accept (do I need to get it 100% done, or is 80% of the wayinteresting?)


Tim

Mark Miller wrote:

And by the way, I cannot see it ever making sense to keep reopening an index
reader every second or so. It has to be MUCH more efficient to even wait
every 2 or 4 seconds...even that is going to be pretty nasty, but you have
to allow for a bit of batch man. You will waste so much time opening those
readers that its not going to be real-time anyway. You are just going to be
in a world of slow.

On 7/30/07, Mark Miller <[EMAIL PROTECTED]> wrote:

I believe there is an issue in JIRA that handles reopening an IndexReader
without reopening segments that have not changed.

On 7/30/07, Tim Sturge < [EMAIL PROTECTED]> wrote:

Thanks for the reply Erick,

I believe it is the gc for four reasons:

- I've tried the "warmup" approach alredy and it didn't change the
situation.

- The server completely pauses for several seconds. I run jstack to find
out where the pause is, and it also pauses for several seconds before
telling me the server is doing something perfectly innocuous. If I was
stuck in some search overhead, I would expect jstack to tell me where
(and I would expect the where to be somewhere interesting and vaguely
repeatable)

- The impact is very uneven. Over 50000 queries (sequentially) I get
49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I
really would be much happier with a consistent 10msec (which adds up to
the same amount of time in total) or even 25msec

- "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get
100 msec and 1 sec pauses instead, but 5x as many for slower overall
time; 1 sec is far too slow)

Your solution looks possible, but seems really too complex for what I am
trying to do (which is basic incremental update). What I really am
looking for is a way to avoid reopening the first segment of my FSDir. I

have a single 6G segment, and then another 20-50 segments with updates,
but they are <100M in total size. So if I could have lucene open just
the segments file and the new or changed *.del and *.cfs files (without
reopening the unchanged *.cfs files) that would be a huge win for me I
think.

It strikes me this should be possible with a thin but complex layer
between the SegmentReader and MultiReader, and perhaps a way to get
SegmentReader to update what *.del file it is using. I'm just curious
why this doesn't already exist.

Tim

Erick Erickson wrote:

Why do you believe that it's the gc? I admit i just scanned your
e-mail, but I *do* know that the first search (especially sorts) on
a newly-opened IndexReader incure a bunch of overhead. Could
that be what you're seeing?

I'm not sure there is a "best practice", but I have seen two
solutions mentioned, both more complex than opening/closing
the reader.
1> open the reader in the background, fire a few "warmup" queries
at it, then switch it with the one you actually use to answer queries.

2> Use a RAMDirectory to hold your new entries for some period

of time. You'd have to do some fancy dancing to keep this straight
since you're updating documents, but it might be viable. The scheme
is something like
Open your FSDIR
Open a RAMdir.

Add all new documents to BOTH of them. When servicing a query,
look in both indexes, but you only open/close the RAMdir for
every query. Note that since, when you open a reader, it
takes a snapshot of the index, these two views will be disjoint. When

you

get your results back, you'll have to do something about the documents

from the FSdir that have been replaced in the RAMdir, which is where

the fancy dancing part comes in. But I leave that as an exercise for
the reader.

Periodically, shut everything down and repeat. The point here is that
you can (probably) close/open your RAMdir with very small costs and
have the whole thing be up to date.

There'll be some coordination issues, and you'll have to cope with

data

integrity if your process barfs before you've closed your FSDir....

Or, you could ask whether 5 seconds is really necessary.I've seen a

lot

of times when "real time" could be 5 minutes and nobody would really
complain, and other times when it really is critical. But that's

between you

and our Product Manager....

Hope this helps
Erick

On 7/25/07, Tim Sturge <[EMAIL PROTECTED]> wrote:

Hi,

I am indexing a set of constantly changing documents. The change rate

is

moderate (about 10 docs/sec over a 10M document collection with a 6G
total size) but I want to be  right up to date (ideally within a

second

but within 5 seconds is acceptable) with the index.

Right now I have code that adds new documents to the index and

deletes

old ones using updateDocument() in the 2.1 IndexWriter. In order to

see

the changes, I need to recreate the IndexReader/IndexSearcher every
second or so. I am not calling optimize() on the index in the writer,
and the mergeFactor is 10.

The problem I am facing is that java gc is terrible at collecting the

IndexSearchers I am discarding. I usually have a 3msec query time,

but I

get gc pauses of 300msec to 3 sec (I assume is is collecting the
"tenured" generation in these pauses, which is my old IndexSearcher)

I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
calling System.gc() right after I close the old index without much

luck

(I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
pauses). So my question is, should I be avoiding reloading my index

in

this way? Should I keep a separate IndexReader (which only deletes

old

documents) and one for new documents? Is there a standard technique

for

a quickly changing index?

Thanks,

Tim


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: java gc with a frequently changing index?

Reply via email to