Hi, Thanks Shai and Mike for your suggestions. I went with Shai's second approach. However, I'm confronted with this now:
After deleting that document from the index, I also delete it from a copy of the directory that contained the original documents. With this, I expected that both the directory as well as the index, both shouldn't have had the document. More precisely, I have taken this updated directory and take each document in that directory and convert it to a query. I then send this query to the index via IndexSearcher and examine the hits for each document. For some reason, I get a document which I had deleted from the index (via IndexReader). Is there any valid explanation for this? How can I be assured that the index will not contain that document. Here's the code snippet I am experimenting this with (hopefully things are self explanatory): System.out.println("Documents which are in the whitelist : "+docsEncounteredNames.toString()); IndexReader reader = IndexReader.open(indexDir); for(int doc_itr=0; doc_itr < reader.maxDoc(); doc_itr++) { if(docsEncountered.contains(doc_itr)) { //skip if I encountered this document continue; } else if (!reader.isDeleted(doc_itr)) { System.out.println("Deleting document with name: "+reader.document(doc_itr).get("filename")); File docToDelete = new File(orgDocsDir+"/"+reader.document(doc_itr).get("filename")); reader.deleteDocument(doc_itr); System.out.println("Also deleting original document "+docToDelete.getCanonicalPath()); docToDelete.delete(); } } Best, Anuj On Thu, Jul 23, 2009 at 6:24 AM, Michael McCandless<luc...@mikemccandless.com> wrote: > I think you could also delete by Query (using IndexWriter), concocting > a single large query that's something like MatchAllDocsQuery AND NOT > (Q1 OR Q2 OR Q3...) where Q1, Q2, Q3 are the queries that identify the > docs you want to keep. > > Mike > > On Wed, Jul 22, 2009 at 10:58 PM, Anuj Bhatt<anuj.bh...@gmail.com> wrote: >> Hi, >> >> I'm relatively new to Lucene. I have the following case: I have >> indexed a bunch of documents. I then, query the index using >> IndexSearcher and retrieve the documents using Hits (I do know this is >> deprecated -- I'm using v 2.4.1). So, I do this for a set of queries >> and maintain which documents are returned to each one. In the end of >> it all, I have a list of documents maintained (more specifically, the >> hits.id(some_iterator_int) associated with the doc). Now, I wish to >> delete the documents which have not been returned for any query, from >> the index. How can I do this? >> >> My initial assumption was that I could retrieve all the doc ids from >> IndexReader and just traverse the list that I have maintained, if it >> is in the list, I don't delete it otherwise I do. Looking around >> didn't yield anything, and hence the mail. >> >> >> Any suggestions? >> >> >> Regards, >> Anuj >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org