On Tue, Sep 22, 2009 at 2:53 PM, Grant Ingersoll <[email protected]> wrote: > One of the pieces I still am missing from all of this is why isn't > IW.getReader() now just the preferred way of getting a IndexReader > for all applications other than those that are completely batch > oriented? > > Why bother with IndexReader.reopen()?
I agree, most apps should simply use getReader, as long as they're running in the same JVM as the IndexWriter, and, they are holding the IW open anyway. But, the returned reader is read-only, so you can't use it to change norms, do deletes, etc. The API really shouldn't be marked expert. I'll go remove that... > Lucene has, in fact, always been about incremental updates (since > there are commercial systems out there that require complete > re-indexing) True, for writing. But for reading, reopening a reader was very costly before 2.9 because FieldCache entry had to be fully recomputed. So, switching to per-segment search/collect in 2.9 was the biggest step to reducing NRT reopen latency. > and that getting IR.reopen to perform is just a matter of tuning > one's application in regards to reads and writes vs. having to do > all this work in the IndexWriter that now tightly couples the > IndexReader to the IndexWriter. The integration with IndexWriter allows a reader to access segments that haven't yet been committed to the index. This saves fsync()'ing the written files, saves writing a new segments_N file, saves flushing deletes to disk and then reloading them (we just share the BitVector directly in RAM now). On many OS/filesystems fsync is surprisingly costly. LUCENE-1313, the next step for NRT, further reduces NRT reopen latency by allowing the small segments to remain in RAM, so when reopening your NRT reader after smallish add/deletes no IO is incurred. Beyond LUCENE-1313 we've discussed making IndexWriter's RAM buffer directly searchable, so you don't pay the cost of pinching a new segment when an NRT reader is reopened. Really we only need to further improve the approach here if the existing performance proves inadequate... in my limited testing the performance was excellent. Though, our inability to prioritize IO and control the OS's IO cache, from java, are likely far bigger impacts on our NRT performance at this point, than further improvements in our impl. I'd love to see a Directory impl that "emulates" IO prioritization by making merging IO wait whenever search IO is live. I think we need a JNI extension that taps into madvise/posix_fadvise, when possible. > FWIW, I still don't like the coupling of the two. I think it would > be better if IW allowed you to get a Directory (or some other > appropriate representation) representing the in memory segment that > can then easily be added to an existing Searcher/Reader. This would > at least decouple the two and instead use the common data structure > they both already share, i.e. the Directory. Whether this is doable > or not, I am not sure. I agree the coupling is overkill. But Directory is too low... we could probably get by with a class that holds the SegmentReader cache (currently IndexWriter.ReaderPool), and the "current" segmentInfos. IW would interact with this class to get the readers it needs, for applying deletes, merging, as well as posting newly flushed but not yet committed segments, and IR would then pull from this class to get the latest segments in the index and to checkout the readers. Such a shared "per-segment state" class could also be the basis for app-specific custom caches to update themselves when new segments are created, old ones are merged, etc. Probably this class should break out SR's core separately. Hmm. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
