Chris,

Mark Miller & Co. are working on (Near) Duplicate Detection.  I think the work 
is in Solr's JIRA, but some of it might be applicable to Lucene.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Chris Lu <chris...@gmail.com>
> To: "java-user@lucene.apache.org" <java-user@lucene.apache.org>
> Sent: Monday, December 29, 2008 4:55:14 AM
> Subject: duplication checking while indexing
> 
> I am wondering whether there is an easy way to avoid duplication while
> indexing, just using the index being created, without creating other data
> structures.
> In some cases, the incoming document list can have duplicates. For example,
> when creating spell checking indexes for phrases. Each phrase is one
> document. So I want to check whether the phrase is already indexed or not.
> 
> To do so, I can either create a hash map for all the indexed phrases. But
> the hash map would consume a lot of memory.
> A possible alternative is to search existing index. But remember the index
> is being created, and not all contents are flushed to disk yet.
> 
> Is it possible to query the not-yet-closed index?
> 
> -- 
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request) got
> 2.6 Million Euro funding!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to