Chris, Mark Miller & Co. are working on (Near) Duplicate Detection. I think the work is in Solr's JIRA, but some of it might be applicable to Lucene.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Chris Lu <chris...@gmail.com> > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org> > Sent: Monday, December 29, 2008 4:55:14 AM > Subject: duplication checking while indexing > > I am wondering whether there is an easy way to avoid duplication while > indexing, just using the index being created, without creating other data > structures. > In some cases, the incoming document list can have duplicates. For example, > when creating spell checking indexes for phrases. Each phrase is one > document. So I want to check whether the phrase is already indexed or not. > > To do so, I can either create a hash map for all the indexed phrases. But > the hash map would consume a lot of memory. > A possible alternative is to search existing index. But remember the index > is being created, and not all contents are flushed to disk yet. > > Is it possible to query the not-yet-closed index? > > -- > Chris Lu > ------------------------- > Instant Scalable Full-Text Search On Any Database/Application > site: http://www.dbsight.net > demo: http://search.dbsight.com > Lucene Database Search in 3 minutes: > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > DBSight customer, a shopping comparison site, (anonymous per request) got > 2.6 Million Euro funding! --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org