Re: similarity matrix - more clear

Chris Hostetter Tue, 30 Nov 2004 14:48:16 -0800

: A possible solution would be to initialize in turn each document as a
: query, do a search using an IndexSearcher and to take from the search
: result the similarity between the query (which is in fact a document)
: and all the other documents. This is highly redundant, because the
: similarity between a pair of documents is computed multiple times.


A simpler aproach that i can think of would be to iterate over a complete
TermEnum of hte index, and for each Term, get the corisponding TermDocs
enumerator to list every document that contains that term.  Assuming that
every pair of docs initially has a similarity of "0" this would allow you
to incriment the similarity of each pair everytime you find a term that
multiple docs have in common.  (the amount you incriment the score for
each pair could be based on TermEnum.docFreq() and TermDocs.freq()).

A very simple approach might be something like...

   IndexReader r = ...;
   int[][] scores = new int[r.maxDocs()][r.maxDocs()];
   TermEnum enumerator = r.terms();
   TermDocs termDocs = r.termDocs();
   do {
      Term term = enumerator.term();
      if (term != null) {
         termDocs.seek(enumerator.term());
         Map docs = new HashMap();
         while (termDocs.next()) {
            docs.put(termDocs.doc(),termDoc.freq());
         }
         for (Iterator i = docs.keySet().iterator(); i.hasNext();) {
            for (Iterator j = docs.keySet().iterator(); j.hasNext();) {
               ii == i.next();
               jj = j.next();
               if (ii < jj) {
                  continue; // do each pair only once
               }
               scores[jj][ii] += (docs.get(ii) + docs.get(jj)) / 2
            }
         }
      } else {
         break;
      }
   } while (enumerator.next());


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: similarity matrix - more clear

Reply via email to