I believe there is an issue in JIRA that handles reopening an IndexReader without reopening segments that have not changed.
On 7/30/07, Tim Sturge <[EMAIL PROTECTED]> wrote: > > Thanks for the reply Erick, > > I believe it is the gc for four reasons: > > - I've tried the "warmup" approach alredy and it didn't change the > situation. > > - The server completely pauses for several seconds. I run jstack to find > out where the pause is, and it also pauses for several seconds before > telling me the server is doing something perfectly innocuous. If I was > stuck in some search overhead, I would expect jstack to tell me where > (and I would expect the where to be somewhere interesting and vaguely > repeatable) > > - The impact is very uneven. Over 50000 queries (sequentially) I get > 49500 at 3 msec, 450 at 300 msec and 50 at 3 sec or more (ouch). I > really would be much happier with a consistent 10msec (which adds up to > the same amount of time in total) or even 25msec > > - "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" changes the pauses (I get > 100 msec and 1 sec pauses instead, but 5x as many for slower overall > time; 1 sec is far too slow) > > Your solution looks possible, but seems really too complex for what I am > trying to do (which is basic incremental update). What I really am > looking for is a way to avoid reopening the first segment of my FSDir. I > have a single 6G segment, and then another 20-50 segments with updates, > but they are <100M in total size. So if I could have lucene open just > the segments file and the new or changed *.del and *.cfs files (without > reopening the unchanged *.cfs files) that would be a huge win for me I > think. > > It strikes me this should be possible with a thin but complex layer > between the SegmentReader and MultiReader, and perhaps a way to get > SegmentReader to update what *.del file it is using. I'm just curious > why this doesn't already exist. > > Tim > > Erick Erickson wrote: > > Why do you believe that it's the gc? I admit i just scanned your > > e-mail, but I *do* know that the first search (especially sorts) on > > a newly-opened IndexReader incure a bunch of overhead. Could > > that be what you're seeing? > > > > I'm not sure there is a "best practice", but I have seen two > > solutions mentioned, both more complex than opening/closing > > the reader. > > 1> open the reader in the background, fire a few "warmup" queries > > at it, then switch it with the one you actually use to answer queries. > > > > 2> Use a RAMDirectory to hold your new entries for some period > > of time. You'd have to do some fancy dancing to keep this straight > > since you're updating documents, but it might be viable. The scheme > > is something like > > Open your FSDIR > > Open a RAMdir. > > > > Add all new documents to BOTH of them. When servicing a query, > > look in both indexes, but you only open/close the RAMdir for > > every query. Note that since, when you open a reader, it > > takes a snapshot of the index, these two views will be disjoint. When > you > > get your results back, you'll have to do something about the documents > > from the FSdir that have been replaced in the RAMdir, which is where > > the fancy dancing part comes in. But I leave that as an exercise for > > the reader. > > > > Periodically, shut everything down and repeat. The point here is that > > you can (probably) close/open your RAMdir with very small costs and > > have the whole thing be up to date. > > > > There'll be some coordination issues, and you'll have to cope with data > > integrity if your process barfs before you've closed your FSDir.... > > > > Or, you could ask whether 5 seconds is really necessary.I've seen a lot > > of times when "real time" could be 5 minutes and nobody would really > > complain, and other times when it really is critical. But that's between > you > > and our Product Manager.... > > > > Hope this helps > > Erick > > > > On 7/25/07, Tim Sturge <[EMAIL PROTECTED]> wrote: > > > >> Hi, > >> > >> I am indexing a set of constantly changing documents. The change rate > is > >> moderate (about 10 docs/sec over a 10M document collection with a 6G > >> total size) but I want to be right up to date (ideally within a second > >> but within 5 seconds is acceptable) with the index. > >> > >> Right now I have code that adds new documents to the index and deletes > >> old ones using updateDocument() in the 2.1 IndexWriter. In order to see > >> the changes, I need to recreate the IndexReader/IndexSearcher every > >> second or so. I am not calling optimize() on the index in the writer, > >> and the mergeFactor is 10. > >> > >> The problem I am facing is that java gc is terrible at collecting the > >> IndexSearchers I am discarding. I usually have a 3msec query time, but > I > >> get gc pauses of 300msec to 3 sec (I assume is is collecting the > >> "tenured" generation in these pauses, which is my old IndexSearcher) > >> > >> I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and > >> calling System.gc() right after I close the old index without much luck > >> (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec > >> pauses). So my question is, should I be avoiding reloading my index in > >> this way? Should I keep a separate IndexReader (which only deletes old > >> documents) and one for new documents? Is there a standard technique for > >> a quickly changing index? > >> > >> Thanks, > >> > >> Tim > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
