Hi,

Thanks Shai and Mike for your suggestions. I went with Shai's second
approach. However, I'm confronted with this now:

After deleting that document from the index, I also delete it from a
copy of the directory that contained the original documents. With
this, I expected that both the directory as well as the index, both
shouldn't have had the document. More precisely, I have taken this
updated directory and take each document in that directory and convert
it to a query. I then send this query to the index via IndexSearcher
and examine the hits for each document. For some reason, I get a
document which I had deleted from the index (via IndexReader). Is
there any valid explanation for this? How can I be assured that the
index will not contain that document. Here's the code snippet I am
experimenting this with (hopefully things are self explanatory):


        System.out.println("Documents which are in the whitelist :
"+docsEncounteredNames.toString());
        IndexReader reader = IndexReader.open(indexDir);
        
        for(int doc_itr=0; doc_itr < reader.maxDoc(); doc_itr++)
        {
                if(docsEncountered.contains(doc_itr))
                {
                       //skip if I encountered this document
                        continue;
                }
                else if (!reader.isDeleted(doc_itr))
                {
                        System.out.println("Deleting document with name:
"+reader.document(doc_itr).get("filename"));
                        File docToDelete = new
File(orgDocsDir+"/"+reader.document(doc_itr).get("filename"));
                        reader.deleteDocument(doc_itr);
                        System.out.println("Also deleting original document
"+docToDelete.getCanonicalPath());
                        docToDelete.delete();
                }
        }

Best,
Anuj


On Thu, Jul 23, 2009 at 6:24 AM, Michael
McCandless<luc...@mikemccandless.com> wrote:
> I think you could also delete by Query (using IndexWriter), concocting
> a single large query that's something like MatchAllDocsQuery AND NOT
> (Q1 OR Q2 OR Q3...) where Q1, Q2, Q3 are the queries that identify the
> docs you want to keep.
>
> Mike
>
> On Wed, Jul 22, 2009 at 10:58 PM, Anuj Bhatt<anuj.bh...@gmail.com> wrote:
>> Hi,
>>
>> I'm relatively new to Lucene. I have the following case: I have
>> indexed a bunch of documents. I then, query the index using
>> IndexSearcher and retrieve the documents using Hits (I do know this is
>> deprecated -- I'm using v 2.4.1). So, I do this for a set of queries
>> and maintain which documents are returned to each one. In the end of
>> it all, I have a list of documents maintained (more specifically, the
>> hits.id(some_iterator_int) associated with the doc). Now, I wish to
>> delete the documents which have not been returned for any query, from
>> the index. How can I do this?
>>
>> My initial assumption was that I could retrieve all the doc ids from
>> IndexReader and just traverse the list that I have maintained, if it
>> is in the list, I don't delete it otherwise I do. Looking around
>> didn't yield anything, and hence the mail.
>>
>>
>> Any suggestions?
>>
>>
>> Regards,
>> Anuj
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to