duplication checking while indexing

Chris Lu Mon, 29 Dec 2008 01:55:51 -0800

I am wondering whether there is an easy way to avoid duplication while
indexing, just using the index being created, without creating other data
structures.
In some cases, the incoming document list can have duplicates. For example,
when creating spell checking indexes for phrases. Each phrase is one
document. So I want to check whether the phrase is already indexed or not.


To do so, I can either create a hash map for all the indexed phrases. But
the hash map would consume a lot of memory.
A possible alternative is to search existing index. But remember the index
is being created, and not all contents are flushed to disk yet.

Is it possible to query the not-yet-closed index?

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

duplication checking while indexing

Reply via email to