On Tue, Jul 13, 2010 at 5:17 PM, Max Lynch <ihas...@gmail.com> wrote: > Hi, > I would like to continuously iterate over the documents in my lucene index > as the index is updated. Kind of like a "stream" of documents. Is there a > way I can achieve this? > > Would something like this be sufficient (untested): > > int currentDocId = 0; > while(true) { > > for(; currentDocId < reader.maxDoc(); currentDocId++) { > > if(!reader.isDeleted(currentDocId)) { > Document d = reader.document(currentDocId); > } > } > > // Maybe sleep here or something > > IndexReader newReader = reader.reopen(); > if(newReader != reader) { > reader.close(); > reader = newReader; > } > }
Looks ok, > > Right now, I do some NLP on the index that would slow down my indexing if > done at the same time, so that is why I'm looking for a solution that works > in the background like this. Another concern I have is that starting from > scratch (fresh invocation of my program) requires me to load a lot of extra > data and then iterate through hundreds of thousands of documents just to get > to the newest docs that I haven't processed yet. I would rather just start > from the new newest doc and go forward. > > I am currently checking whether or not I've processed a Document by looking > up a field in the Document in a Mongo db, but is there a way I could > reliably use the id of the document from the reader to check to see if I've > looked at this document already? I've heard that IndexReader.document() is > slow so I would like to skip that call if I know I've processed the document > already. You could have a field within each doc say "Processed" and store a value Yes/No, next run a searcher query which should give you the collection of unprocessed ones. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org