I'll probably end up using a filtered IndexSearcher, but let me try to take a step back and explain what I'm trying to do since It relates to a lot of recent development in trunk (this probably belongs under java-user now).
We use Lucene in combination with MySQL to store data in a legacy homegrown CMS. All data is stored as key/value pairs, much like Lucene's Documents, and all querying is done through Lucene. One requirement that much of the CMS has been built with is that the search index provides database-like write/query consistency (read: as soon as an item is added, updated or deleted, that is reflected in queries). This is obviously much improved with NRT search. So my current strategy is very similar to what Jason has going on in LUCENE-1313. I've got a disk index, a RAM index, and 0..n indexes in between waiting to be flushed to disk (indexes are pushed onto the flush queue when they hit some predefined size -- 5mb). This is all very pretty and straight forward on the whiteboard, but there are a number of subtleties that pop up during implementation. Each time the index is updated, the current IndexReader is marked as needing replcement. The next time a reader is needed (for a search, generally), I have to gather up all of the indexes in a correct state and get current readers for them. Since we're going for consistency, this (currently) means blocking out writes while I'm creating a reader (there may be some more complicated and efficient way to go about this). I return a MultiReader including: an IndexReader for the disk index, IW.getReader() for the current RAM index, IW.getReader() for all queued RAM indexes (including one, if any, currently being flushed to disk). I was originally using an IndexWriter and IW.getReader() for the disk index as well, but to keep my consistency I had to block reader creation while writing an index out to disk. Since IW doesn't support adding and deleting a set of documents atomically, it would be possible for a reader to call IW.getReader() while I'm in the middle of adding and removing documents from the disk writer. I could possibly use addIndexesNoOptimize, but the bit about it possibly requiring 2x index space scared me away. So now I'm using a plain IndexReader for the disk. Like 1313, each time something is updated or removed from the primary RAM index, I have to suppress the same content in all of the queued RAM indexes and the disk index. For the small RAM indexes, I'm doing this with IndexWriter.delete + IW.getReader(). For the disk index, I'm keeping a BitVector of filtered docs and searching for docs in the current disk IndexReader that match the provided updated/deleted Terms. This is why I was looking for a filterable IndexReader. Implementing this way allows me to write RAM indexes out to disk without blocking readers, and only block readers when I need to remap any filtered docs that may have been updated or deleted during the flushing process. I think this may beat using a straight IW for my requirements, but I'm not positive yet. So I've currently got a SuppressedIndexReader extends FilterIndexReader, but due to 1483 and 1573 I had to implement IndexReader.getFieldCacheKey() to get any sort of decent search performance, which I'd rather not do since I'm aware its only temporary. So, I have a couple of questions: Is it possible to perform a bunch of adds and deletes from an IW in an atomic action? Should I use addIndexesNoOptimize? If I go the filtered searcher direction, my filter will have to be aware of the portion of the MultiReader that corresponds to the disk index. Can I assume that my disk index will populate the lower portion of doc id space if it comes first in the list passed to the MultiReader constructor? The code says yes but the docs don't say anything. If you've followed any of what I've said and have some suggestions/comments, they'd be much appreciated. -Jeremy On Thu, Apr 9, 2009 at 8:01 PM, Michael McCandless < [email protected]> wrote: > On Thu, Apr 9, 2009 at 7:02 PM, Jeremy Volkman <[email protected]> wrote: > > > I'm sure I can extend my wrapping reader to also wrap whatever is > returned > > by getSequentialSubReaders, however all of what I'm writing is already > done > > by IndexReader with respect to deletions. What if, instead of throwing > > UnsupportedOperationExceptions, a read-only IndexReader did everything it > > normally does with deletes up to the point of actually writing the .del > > file. This would allow documents to be removed from the reader for the > > lifetime of the reader, and seems like it might be a minimal change. > > Well... readOnly IR relies on its deletedDocs never being changed, to > allow isDeleted to be unsynchronized. > > Is this only for searching? Could you just use a Filter with your search? > > Or... you could make silly FSDirectory extension that pretends to > write outputs but never does, and pass it to IR.open? > > Or maybe we should open up a way to discard pending changes in an IR > (like IW.rollback). > > Or, with near real-time search (in trunk) you could 1) open IW with > autoCommit=false, 2) make your pretend deletes, 3) get a near > real-time reader from the IW (IW.getReader()), 4) do stuff with that > reader, 5) call IW.rollback() to discard your changes when done, and > close the reader. > > One drawback with using deletes "temporarily" (as your filter) is you > won't be able to do any real deletes. > > Mike > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
