I am wondering whether there is an easy way to avoid duplication while indexing, just using the index being created, without creating other data structures. In some cases, the incoming document list can have duplicates. For example, when creating spell checking indexes for phrases. Each phrase is one document. So I want to check whether the phrase is already indexed or not.
To do so, I can either create a hash map for all the indexed phrases. But the hash map would consume a lot of memory. A possible alternative is to search existing index. But remember the index is being created, and not all contents are flushed to disk yet. Is it possible to query the not-yet-closed index? -- Chris Lu ------------------------- Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding!