Hello Herb. Thank you very much for your reply. I want to have the cosine for
each a and each b. I'm using code for lucene I found online, which I will post
below.
Hello Uwe. Thank you very much for replying. I am using a class DocVector and
then a class in which i try to compute the similarities from documents that
were indexed in two folders. Here is the code for the two classes.
Could you please help me? What am I doing wrong? Thank you very much!
package NewApp;
import extractout.*;
import java.util.Map;
import org.apache.commons.math3.linear.OpenMapRealVector;
import org.apache.commons.math3.linear.RealVectorFormat;
import org.apache.commons.math3.linear.SparseRealVector;
/**
*
* @author Stefy
*/
class DocVector {
public Map<String,Integer> terms;
public SparseRealVector vector;
public DocVector(Map<String,Integer> terms) {
this.terms = terms;
this.vector = new OpenMapRealVector(terms.size());
}
public void setEntry(String term, int freq) {
if (terms.containsKey(term)) {
int pos = terms.get(term);
vector.setEntry(pos, (double) freq);
}
}
public void normalize() {
double sum = vector.getL1Norm();
vector = (SparseRealVector) vector.mapDivide(sum);
}
@Override
public String toString() {
RealVectorFormat formatter = new RealVectorFormat();
return formatter.format(vector);
}
}
---------------------------------------------------------------------------------------
public class testCosine {
static String in_B = "/local/march_exp/in_B";
static String data_B = "/local/march_exp/B_split100_EN";
static String in_A = "/local/march_exp/in_A";
static String data_A = "/local/march_exp/A_split100_EN";
static File indexDir_B, dataDir_B, indexDir_A, dataDir_A;
static IndexReader reader_A, reader_B;
static Directory dir_B, dir_A;
static int size_B = 23992, size_A = 10995;
private static double getCosineSimilarity(DocVector d1, DocVector d2) {
return (d1.vector.dotProduct(d2.vector))
/ (d1.vector.getNorm() * d2.vector.getNorm());
}
public static void testSimilarityUsingCosine() throws Exception {
indexDir_A = new File(in_A);
dir_A = FSDirectory.open(indexDir_A);
reader_A = IndexReader.open(dir_A);
indexDir_B = new File(in_B);
dir_B = FSDirectory.open(indexDir_B);
reader_B = IndexReader.open(dir_B);
Map<String, Integer> terms_A = new HashMap<String, Integer>();
TermEnum termEnum_A = reader_A.terms(new Term("contents"));
Map<String, Integer> terms_B = new HashMap<String, Integer>();
TermEnum termEnum_B = reader_B.terms(new Term("contents"));
int pos = 0;
while (termEnum_A.next()) {
Term term = termEnum_A.term();
if (!"contents".equals(term.field())) {
break;
}
terms_A.put(term.text(), pos++);
}
pos = 0;
while (termEnum_B.next()) {
Term term = termEnum_B.term();
if (!"contents".equals(term.field())) {
break;
}
terms_B.put(term.text(), pos++);
}
int[] docIds_A = new int[size_A];
DocVector[] docs_A = new DocVector[docIds_A.length];
int i = 0;
for (int docId : docIds_A) {
TermFreqVector[] tfvs = reader_A.getTermFreqVectors(docId);
docs_A[i] = new DocVector(terms_A);
for (TermFreqVector tfv : tfvs) {
String[] termTexts = tfv.getTerms();
int[] termFreqs = tfv.getTermFrequencies();
for (int j = 0; j < termTexts.length; j++) {
docs_A[i].setEntry(termTexts[j], termFreqs[j]);
}
}
docs_A[i].normalize();
i++;
}
int[] docIds_B = new int[size_B];
DocVector[] docs_B = new DocVector[docIds_B.length];
i = 0;
for (int docId : docIds_B) {
TermFreqVector[] tfvs = reader_B.getTermFreqVectors(docId);
docs_B[i] = new DocVector(terms_B);
for (TermFreqVector tfv : tfvs) {
String[] termTexts = tfv.getTerms();
int[] termFreqs = tfv.getTermFrequencies();
for (int j = 0; j < termTexts.length; j++) {
docs_B[i].setEntry(termTexts[j], termFreqs[j]);
}
}
docs_B[i].normalize();
}
FileWriter fstream_c = new
FileWriter("/local/march_exp/COS/COSINE_.txt");
BufferedWriter writer_c = new BufferedWriter(fstream_c);
double[][] cosimvect = new double[size_A][size_B];
for (i = 0; i < size_A; i++) {
for (int j = 0; j < size_B; j++) {
cosimvect[i][j] = getCosineSimilarity(docs_A[i], docs_B[j]);
System.out.println("cosine between " + i + " " + j + " is " +
cosimvect[i][j]);
}
}
writer_c.close();
reader_B.close();
reader_A.close();
dir_B.close();
dir_A.close();
}
public static void main(String[] args) throws Exception {
testSimilarityUsingCosine();
}
}
On Friday, March 21, 2014 12:14 AM, Uwe Schindler <u...@thetaphi.de> wrote:
Hi Stefy,
the stack trace you posted has nothing to do with Apache Lucene. It looks like
you are using some commons-lang3 classes here, but no Lucene code at all. So I
think your question might be better asked on the commons-math mailing list,
unless you have some Lucene code around, too. If this is the case, you should
give more information how you use Lucene.
Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
-----Original Message-----
From: Stefy D. [mailto:tsuki_st...@yahoo.com]
Sent: Thursday, March 20, 2014 10:05 PM
To: java-user@lucene.apache.org
Subject: Dimension mismatch exception
Dear all,
I am trying to compute the cosine similarity between several documents. I
have an indexed directory A made using 10000 files and another indexed
directory B made using 20000 files. All the indexed documents from both
directories have the same length (100 sentences). I want to get the cosine
similarity between documents from directory A and documents from
directory B. I have used the code from here but on the two indexed
directories. So I use something like getCosineSimilarity(docs_A[i], docs_B[j]);
I get the following error:
Exception in thread "main"
org.apache.commons.math3.exception.DimensionMismatchException:
44,375 != 596,263
at
org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real
Vector.java:179)
at
org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real
Vector.java:165)
at
org.apache.commons.math3.linear.RealVector.dotProduct(RealVector.java:3
07)
at NewApp.testCosine.getCosineSimilarity(testCosine.java:57)
Please help me. Thank you very much!
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org