Otis, thanks for the pointer. I think the question can be: How to access TermEnum or TermInfos during indexing.
If this is possible, things would be easier. -- Chris Lu ------------------------- Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Mon, Dec 29, 2008 at 10:41 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Chris, > > Mark Miller & Co. are working on (Near) Duplicate Detection. I think the > work is in Solr's JIRA, but some of it might be applicable to Lucene. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Chris Lu <chris...@gmail.com> > > To: "java-user@lucene.apache.org" <java-user@lucene.apache.org> > > Sent: Monday, December 29, 2008 4:55:14 AM > > Subject: duplication checking while indexing > > > > I am wondering whether there is an easy way to avoid duplication while > > indexing, just using the index being created, without creating other data > > structures. > > In some cases, the incoming document list can have duplicates. For > example, > > when creating spell checking indexes for phrases. Each phrase is one > > document. So I want to check whether the phrase is already indexed or > not. > > > > To do so, I can either create a hash map for all the indexed phrases. But > > the hash map would consume a lot of memory. > > A possible alternative is to search existing index. But remember the > index > > is being created, and not all contents are flushed to disk yet. > > > > Is it possible to query the not-yet-closed index? > > > > -- > > Chris Lu > > ------------------------- > > Instant Scalable Full-Text Search On Any Database/Application > > site: http://www.dbsight.net > > demo: http://search.dbsight.com > > Lucene Database Search in 3 minutes: > > > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > > DBSight customer, a shopping comparison site, (anonymous per request) got > > 2.6 Million Euro funding! > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >