Re: java gc with a frequently changing index?

Mark Miller Mon, 30 Jul 2007 13:20:14 -0700

I believe there is an issue in JIRA that handles reopening an IndexReader
without reopening segments that have not changed.


On 7/30/07, Tim Sturge <[EMAIL PROTECTED]> wrote:
>
> Thanks for the reply Erick,
>
> I believe it is the gc for four reasons:
>
> - I've tried the "warmup" approach alredy and it didn't change the
> situation.
>
> - The server completely pauses for several seconds. I run jstack to find
> out where the pause is, and it also pauses for several seconds before
> telling me the server is doing something perfectly innocuous. If I was
> stuck in some search overhead, I would expect jstack to tell me where
> (and I would expect the where to be somewhere interesting and vaguely
> repeatable)
>
> - The impact is very uneven. Over 50000 queries (sequentially) I get
> 49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I
> really would be much happier with a consistent 10msec (which adds up to
> the same amount of time in total) or even 25msec
>
> - "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get
> 100 msec and 1 sec pauses instead, but 5x as many for slower overall
> time; 1 sec is far too slow)
>
> Your solution looks possible, but seems really too complex for what I am
> trying to do (which is basic incremental update). What I really am
> looking for is a way to avoid reopening the first segment of my FSDir. I
> have a single 6G segment, and then another 20-50 segments with updates,
> but they are <100M in total size. So if I could have lucene open just
> the segments file and the new or changed *.del and *.cfs files (without
> reopening the unchanged *.cfs files) that would be a huge win for me I
> think.
>
> It strikes me this should be possible with a thin but complex layer
> between the SegmentReader and MultiReader, and perhaps a way to get
> SegmentReader to update what *.del file it is using. I'm just curious
> why this doesn't already exist.
>
> Tim
>
> Erick Erickson wrote:
> > Why do you believe that it's the gc? I admit i just scanned your
> > e-mail, but I *do* know that the first search (especially sorts) on
> > a newly-opened IndexReader incure a bunch of overhead. Could
> > that be what you're seeing?
> >
> > I'm not sure there is a "best practice", but I have seen two
> > solutions mentioned, both more complex than opening/closing
> > the reader.
> > 1> open the reader in the background, fire a few "warmup" queries
> > at it, then switch it with the one you actually use to answer queries.
> >
> > 2> Use a RAMDirectory to hold your new entries for some period
> > of time. You'd have to do some fancy dancing to keep this straight
> > since you're updating documents, but it might be viable. The scheme
> > is something like
> > Open your FSDIR
> > Open a RAMdir.
> >
> > Add all new documents to BOTH of them. When servicing a query,
> > look in both indexes, but you only open/close the RAMdir for
> > every query. Note that since, when you open a reader, it
> > takes a snapshot of the index, these two views will be disjoint. When
> you
> > get your results back, you'll have to do something about the documents
> > from the FSdir that have been replaced in the RAMdir, which is where
> > the fancy dancing part comes in. But I leave that as an exercise for
> > the reader.
> >
> > Periodically, shut everything down and repeat. The point here is that
> > you can (probably) close/open your RAMdir with very small costs and
> > have the whole thing be up to date.
> >
> > There'll be some coordination issues, and you'll have to cope with data
> > integrity if your process barfs before you've closed your FSDir....
> >
> > Or, you could ask whether 5 seconds is really necessary.I've seen a lot
> > of times when "real time" could be 5 minutes and nobody would really
> > complain, and other times when it really is critical. But that's between
> you
> > and our Product Manager....
> >
> > Hope this helps
> > Erick
> >
> > On 7/25/07, Tim Sturge <[EMAIL PROTECTED]> wrote:
> >
> >> Hi,
> >>
> >> I am indexing a set of constantly changing documents. The change rate
> is
> >> moderate (about 10 docs/sec over a 10M document collection with a 6G
> >> total size) but I want to be  right up to date (ideally within a second
> >> but within 5 seconds is acceptable) with the index.
> >>
> >> Right now I have code that adds new documents to the index and deletes
> >> old ones using updateDocument() in the 2.1 IndexWriter. In order to see
> >> the changes, I need to recreate the IndexReader/IndexSearcher every
> >> second or so. I am not calling optimize() on the index in the writer,
> >> and the mergeFactor is 10.
> >>
> >> The problem I am facing is that java gc is terrible at collecting the
> >> IndexSearchers I am discarding. I usually have a 3msec query time, but
> I
> >> get gc pauses of 300msec to 3 sec (I assume is is collecting the
> >> "tenured" generation in these pauses, which is my old IndexSearcher)
> >>
> >> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and
> >> calling System.gc() right after I close the old index without much luck
> >> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec
> >> pauses). So my question is, should I be avoiding reloading my index in
> >> this way? Should I keep a separate IndexReader (which only deletes old
> >> documents) and one for new documents? Is there a standard technique for
> >> a quickly changing index?
> >>
> >> Thanks,
> >>
> >> Tim
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: java gc with a frequently changing index?

Reply via email to