: A possible solution would be to initialize in turn each document as a : query, do a search using an IndexSearcher and to take from the search : result the similarity between the query (which is in fact a document) : and all the other documents. This is highly redundant, because the : similarity between a pair of documents is computed multiple times.
A simpler aproach that i can think of would be to iterate over a complete TermEnum of hte index, and for each Term, get the corisponding TermDocs enumerator to list every document that contains that term. Assuming that every pair of docs initially has a similarity of "0" this would allow you to incriment the similarity of each pair everytime you find a term that multiple docs have in common. (the amount you incriment the score for each pair could be based on TermEnum.docFreq() and TermDocs.freq()). A very simple approach might be something like... IndexReader r = ...; int[][] scores = new int[r.maxDocs()][r.maxDocs()]; TermEnum enumerator = r.terms(); TermDocs termDocs = r.termDocs(); do { Term term = enumerator.term(); if (term != null) { termDocs.seek(enumerator.term()); Map docs = new HashMap(); while (termDocs.next()) { docs.put(termDocs.doc(),termDoc.freq()); } for (Iterator i = docs.keySet().iterator(); i.hasNext();) { for (Iterator j = docs.keySet().iterator(); j.hasNext();) { ii == i.next(); jj = j.next(); if (ii < jj) { continue; // do each pair only once } scores[jj][ii] += (docs.get(ii) + docs.get(jj)) / 2 } } } else { break; } } while (enumerator.next()); --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]