Re: Removing duplicate entries

João Rodrigues Tue, 29 Apr 2008 18:51:52 -0700

First of all, let me apologize for the double post but I got some strange
error message =\


>The first question is what do you mean the document
>is already in the index? Lucene doc IDs are useless
>here since the ones in your FSDir and the ones in your
>RAMdir are unrelated. In fact, I suspect that the
>lucene docIDs will start at the same number in both.

The documents have 2 fields, one of them being their identifier, which is
Indexed and Not Tokenized (and stored).

>Lucene doc IDs are just monotonically incremented integers.
>So, how do you identify identical documents? Is there some
>field in your document that's guaranteed to be unique to each
>document? If so, *that's* the field you can use for termEnu
>to get the Lucene docid to remove, assuming you've indexed
>it UN_TOKENIZED or you are very, very, very confident that
>your tokenizers won't break it up.
>But you can make this easier by using IndexReader.deleteDocument(term)
>where the term is your unique field.

Since I have an unique field for each document, I'll use that then. However,
what's the thinking behind the enum/delete? The termEnum lists me the terms
which have a particular field value and gives me their "lucene ID" so that
the Reader can then delete it?

>Additionally, I question why you bother with a RAMdir for your changes.
>An index reader essentially takes a snapshot of your index, and
>subsequent changes are not seen by your searchers until you
>close and reopen the underlying reader. What advantage do you
>see in using a RAMdir?

I use a RAMdir because it is somehow faster. I'm downloading a few thousand
documents at a time from the internet and then indexing them in a RAM index
and then merging them to the FS dir. Also, in case anything fails, my FS
directory, and thus my "stable" index, is safe from harm :) But why did you
ask? Advice is welcome :)

Thanks,

João Rodrigues

Re: Removing duplicate entries

Reply via email to