First, you'll probably want to search the user list archive for this issue,
as
it's been discussed and you'll find more information than I can remember
off the top of my head. That said:
1> changes to an index are not visible until you reopen the reader. You
probably have to flush the writer in the meantime. And this will
be costly to do for every document.
2> How do you identify duplicates? If it's a short enough signature,
you could consider keeping an in-memory list and check that
while indexing. If you needed to update your index you could
simply use TermEnum/TermDocs to read all the values into
memory before adding to it.
3> You could consider using some kind of calculated signature of
the whole file for your key, but that may not suit your app.
Best
Erick
On Sat, Mar 7, 2009 at 12:21 AM, sonfon <[email protected]> wrote:
> Dear All,
> Now, I'm considering to build index for my application with lucene.
> However, as the document sources I'm going to index has many duplications,
> so before adding a document to an IndexWriter, I hope search in the index
> database first to see if a same document copy has already been added. I used
> IndexSearcher to search the same Dir while IndexWriter writing to it.
> However, it seem that IndexSearcher returned no result though I'm sure there
> are duplicate copies indexed already. And after the indexing procedure, I
> can get the search results, so I'm sure I didn't write the wrong code.
> Anyone could offer some help? Some example codes are appreciated.
> Best
> Wishes.