hey harald, On Mon, Aug 6, 2012 at 1:22 PM, Harald Kirsch <harald.kir...@raytion.com> wrote: > Hi, > > in my application I have to write tons of small documents to the index, but > with a twist. Many of the documents are actually aggregations of pieces of > information that appear in a data stream, usually close together, but > nevertheless merged with information for other documents. > > When information a1 for my document A arrives, I create my A-object, store > it with index.addDocument() and forget about it. Later, when a2 arrives, I > fetch A from the index, delete it from the index, update it, and store its > updated version. To fetch it from the index, I use a reader retrieved with > IndexReader.openIfChanged(). So for one piece of information I have roughly > the following sequence: > > get searcher via IndexReader.openIfChanged() > find previously stored document, if any > if document already available { > update document object > index.deleteDocument(new Term(IDFIELD, id)) > } else { > create document object > } > index.addDocument() > > > The overall speed is not too bad, but I wonder if more is possible. I > changed RAMBufferSizeMB from the default 16 to 200 but saw no improvement in > speed. > > I would think that keeping documents in RAM for some time such that many > updates happen in RAM, rather then being written to disk would improve the > overall running time. > > Any hints how to configure and use Lucene to improve the speed without > layering my own caching on top of it?
what happens if you re-open a reader from an IW (NearRealtime) you flush documents to disk each time you reopen the NRT reader. That likely means if you have high update rates that you don't keep stuff in memory for very long so ram buffer size increase won't help much. What I would try to exploit is the fact that you only need to open a new reader if the document (or its latest update) you are looking for has not been flushed to disk yet ie. is not in reader you already have opened. Lucene ships with some handy tools that helps you to implement this. I'd likely use org.apache.lucene.search.NRTManager that exposes the methods of IW (update/add/delete) and returns a sequence ID that you can later use to request an NRT reader. Lets say you have document X indexed with sequence ID 15 and you now wanna update it you look up the ID of doc X in a hashmap or something like this to get the last changed sequence ID then you ask the NRTManager to refresh the search it holds right now with NRTManager#waitForGeneration(15) if the generation is already refreshed it will return immediately otherwise it will wait until its opened. Then you can just acquire a new searcher and check the document. something like this: String id = doc.getId(); Long seqId = mapping.get(id); if (seqId != null) { nrtManager.waitForGeneration(seqId); } IndexSearcher s = nrtManager.acquire(); try { IndexReader reader = s.getReader(); // do something } finally { nrtManager.release(s); } from time to time you can prune the mapping for sequence ids that are already flushed. hope that helps simon > > Harald. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org