Computing the cosine between two documents requires that the vectors for each document to be the same length (same number of elements, same dimensionality, not the norm). The length of the vector is the length of the vocabulary for the whole set. The two sets will inevitably have different numbers of tokens in their vocabulary. Also, if they are indexed independently, then the words at each position in the two vectors is also going to be different. The number of documents does not matter, nor the number of sentences.

There are several ways to address this problem. One way is to index all of the documents into one index but keep track of which set is which. Then you can run the combinations and compute the cosines. Uwe knows more about how the term vectors are represented in Lucene. You may have to do some extra work to get them into a form that you can use to compute cosines.

That's a lot of combinations, by the way, 10,000 x 20,000 = 200 million comparisons. It's going to take a while.

See this page for some suggestions on how to do it: http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene

On 3/21/2014 1:50 AM, Stefy D. wrote:
Hello Herb. Thank you very much for your reply. I want to have the cosine for 
each a and each b. I'm using code for lucene I found online, which I will post 
below.

Hello Uwe. Thank you very much for replying. I am using a class DocVector and 
then a class in which i try to compute the similarities from documents that 
were indexed in two folders. Here is the code for the two classes.

Could you please help me? What am I doing wrong? Thank you very much!

package NewApp;

import extractout.*;
import java.util.Map;
import org.apache.commons.math3.linear.OpenMapRealVector;
import org.apache.commons.math3.linear.RealVectorFormat;
import org.apache.commons.math3.linear.SparseRealVector;

/**
  *
  * @author Stefy
  */
class DocVector {
public Map<String,Integer> terms;
       public SparseRealVector vector;
public DocVector(Map<String,Integer> terms) {
         this.terms = terms;
         this.vector = new OpenMapRealVector(terms.size());
       }
public void setEntry(String term, int freq) {
         if (terms.containsKey(term)) {
           int pos = terms.get(term);
           vector.setEntry(pos, (double) freq);
         }
       }
public void normalize() {
         double sum = vector.getL1Norm();
         vector = (SparseRealVector) vector.mapDivide(sum);
       }
@Override
       public String toString() {
         RealVectorFormat formatter = new RealVectorFormat();
         return formatter.format(vector);
       }
}

---------------------------------------------------------------------------------------
public class testCosine {

     static String in_B = "/local/march_exp/in_B";
     static String data_B = "/local/march_exp/B_split100_EN";
     static String in_A = "/local/march_exp/in_A";
     static String data_A = "/local/march_exp/A_split100_EN";
     static File indexDir_B, dataDir_B, indexDir_A, dataDir_A;
     static IndexReader reader_A, reader_B;
     static Directory dir_B, dir_A;
     static int size_B = 23992, size_A = 10995;

     private static double getCosineSimilarity(DocVector d1, DocVector d2) {
         return (d1.vector.dotProduct(d2.vector))
                 / (d1.vector.getNorm() * d2.vector.getNorm());
     }

     public static void testSimilarityUsingCosine() throws Exception {

         indexDir_A = new File(in_A);
         dir_A = FSDirectory.open(indexDir_A);
         reader_A = IndexReader.open(dir_A);

         indexDir_B = new File(in_B);
         dir_B = FSDirectory.open(indexDir_B);
         reader_B = IndexReader.open(dir_B);

         Map<String, Integer> terms_A = new HashMap<String, Integer>();
         TermEnum termEnum_A = reader_A.terms(new Term("contents"));
         Map<String, Integer> terms_B = new HashMap<String, Integer>();
         TermEnum termEnum_B = reader_B.terms(new Term("contents"));

         int pos = 0;
         while (termEnum_A.next()) {
             Term term = termEnum_A.term();
             if (!"contents".equals(term.field())) {
                 break;
             }
             terms_A.put(term.text(), pos++);
         }

         pos = 0;
         while (termEnum_B.next()) {
             Term term = termEnum_B.term();
             if (!"contents".equals(term.field())) {
                 break;
             }
             terms_B.put(term.text(), pos++);
         }


         int[] docIds_A = new int[size_A];
         DocVector[] docs_A = new DocVector[docIds_A.length];
         int i = 0;
         for (int docId : docIds_A) {
             TermFreqVector[] tfvs = reader_A.getTermFreqVectors(docId);
             docs_A[i] = new DocVector(terms_A);
             for (TermFreqVector tfv : tfvs) {
                 String[] termTexts = tfv.getTerms();
                 int[] termFreqs = tfv.getTermFrequencies();
                 for (int j = 0; j < termTexts.length; j++) {
                     docs_A[i].setEntry(termTexts[j], termFreqs[j]);
                 }
             }
             docs_A[i].normalize();
             i++;
         }

         int[] docIds_B = new int[size_B];
         DocVector[] docs_B = new DocVector[docIds_B.length];
         i = 0;
         for (int docId : docIds_B) {
             TermFreqVector[] tfvs = reader_B.getTermFreqVectors(docId);
             docs_B[i] = new DocVector(terms_B);
             for (TermFreqVector tfv : tfvs) {
                 String[] termTexts = tfv.getTerms();
                 int[] termFreqs = tfv.getTermFrequencies();
                 for (int j = 0; j < termTexts.length; j++) {
                     docs_B[i].setEntry(termTexts[j], termFreqs[j]);
                 }
             }
             docs_B[i].normalize();
         }

         FileWriter fstream_c = new 
FileWriter("/local/march_exp/COS/COSINE_.txt");
         BufferedWriter writer_c = new BufferedWriter(fstream_c);

         double[][] cosimvect = new double[size_A][size_B];
         for (i = 0; i < size_A; i++) {
             for (int j = 0; j < size_B; j++) {
                 cosimvect[i][j] = getCosineSimilarity(docs_A[i], docs_B[j]);
                 System.out.println("cosine between " + i + " " + j + " is " + 
cosimvect[i][j]);
             }
         }
         writer_c.close();
         reader_B.close();
         reader_A.close();
         dir_B.close();
         dir_A.close();
     }

     public static void main(String[] args) throws Exception {

         testSimilarityUsingCosine();
     }
}




On Friday, March 21, 2014 12:14 AM, Uwe Schindler <u...@thetaphi.de> wrote:
Hi Stefy,

the stack trace you posted has nothing to do with Apache Lucene. It looks like 
you are using some commons-lang3 classes here, but no Lucene code at all. So I 
think your question might be better asked on the commons-math mailing list, 
unless you have some Lucene code around, too. If this is the case, you should 
give more information how you use Lucene.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-----Original Message-----
From: Stefy D. [mailto:tsuki_st...@yahoo.com]
Sent: Thursday, March 20, 2014 10:05 PM
To: java-user@lucene.apache.org
Subject: Dimension mismatch exception

Dear all,

I am trying to compute the cosine similarity between several documents. I
have an indexed directory A made using 10000 files and another indexed
directory B made using 20000 files. All the indexed documents from both
directories have the same length (100 sentences). I want to get the cosine
similarity between documents from directory A and documents from
directory B. I have used the code from here but on the two indexed
directories. So I use something like getCosineSimilarity(docs_A[i], docs_B[j]);

I get the following error:
Exception in thread "main"
org.apache.commons.math3.exception.DimensionMismatchException:
44,375 != 596,263
      at
org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real
Vector.java:179)
      at
org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real
Vector.java:165)
      at
org.apache.commons.math3.linear.RealVector.dotProduct(RealVector.java:3
07)
      at NewApp.testCosine.getCosineSimilarity(testCosine.java:57)

Please help me. Thank you very much!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to