what you are doing below is iterating over every term in your index, and for each Term, recording if that term appears in more then one doc (using IndexReader.document which is a really bad idea ingeneral in a loop like this)
your orriginal problem description was " 'Find Duplicate records for Column X' then I would filter the table to show only records where there is more than one with the same value i.e duplicate for that column." ... if we equate columns with fields, then what you are doing is certainly overkill -- you can request a TermEnum that starts at a particular field Or use the full TermEnum and "skipTo" the first possible temr in your field) and then stop iterating when you get to a new field -- you can aso get a lot better performance when compiling hte list of ROW_NUMBER field values if you use a FieldCache instead of fetching each document. : Date: Mon, 26 Feb 2007 16:25:11 +0000 : From: Paul Taylor <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org, [EMAIL PROTECTED] : To: Erick Erickson <[EMAIL PROTECTED]> : Cc: java-user@lucene.apache.org : Subject: Re: Can I use Lucene to retrieve a list of duplicates : : Hi : : I got it working before I saw your latest mail, the only problem is : that it doesn't look very efficient. This is my duplicate method, the : problem is that I have to enumerate through *every* term. This was worse : before because I was only interested : in terms that matched a particular field (column) but had enumerate : through every term whatever field it was part of, so I recreated my : index so that each document only contained a row number field, and a : second field for the value of the column, however this means I am going : to end up with a number of different indexes each solving a particular : problem. : : paul : : public List<Integer> getDuplicates() : { : List<Integer> matches = new ArrayList<Integer>(); : try : { : IndexReader ir = IndexReader.open(directory); : TermEnum terms = ir.terms(); : while (terms.next()) : { : if (terms.docFreq() > 1) : { : TermDocs termDocs = ir.termDocs(terms.term()); : while (termDocs.next()) : { : Document d = ir.document(termDocs.doc()); : matches.add(new : Integer(d.getField(ROW_NUMBER).stringValue())); : } : } : } : : } : catch (IOException ioe) : { : ioe.printStackTrace(); : } : return matches; : } : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]