Getting term vectors/computing cosine similarity

Michael O'Leary Tue, 27 May 2014 22:18:07 -0700

*tl;dnr*: a next() method is defined for the Java class TVTermsEnum in
Lucene 4.8.1, but it looks like there is no next() method available for an
object that looks like it is an instance of the Python class TVTermsEnum in
PyLucene 4.8.1.


I have a set of documents that I would like to cluster. These documents
share a vocabulary of only about 3,000 unique terms, but there are about
15,000,000 documents. One way I thought of doing this would be to index the
documents using PyLucene (Python is the preferred programming language at
work), obtain term vectors for the documents using PyLucene API functions,
and calculate cosine similarities between pairs of term vectors in order to
determine which documents are close to each other.

I found some sample Java code on the web that various people have posted
showing ways to do this with older versions of Lucene. I downloaded
PyLucene 4.8.1 and compared its API functions with the ones used in the
code samples, and saw that this is an area of Lucene that has changed quite
a bit. I can send an email to the lucene-user mailing group to ask what
would be a good way of doing this using version 4.8.1, but the question I
have for this mailing group has to do with some Java API functions that it
looks like are not exposed in Python, unless I have to go about accessing
them in a different way.

If I obtain the term vector for the field "cat_ids" in a document with id
doc_id_1

doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids")

then doc_1_tfv is displayed as this object:

<Terms:
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396
>

In some of the sample code I looked at, the terms in doc_1_tfv could be
obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a
member function of Terms or its subclasses any more. In another code
sample, an iterator for the term vector is obtained via tfv_iter =
doc_1_tfv.iterator(None) and then the terms are obtained one by one with
calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this
value:

<TermsEnum:
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369
>

and there is a next() function defined for the TVTermsEnum class, but this
object doesn't list next() as one of its member functions and an exception
is raised if it is called. It looks like the object only supports the
member functions defined for the TermsEnum class, and next() is not one of
them. Is this the case, or is there a way have it support all of the
TVTermsEnum member functions, including next()? TVTermsEnum is a private
class in CompressingTermVectorsReader.java.

So I am wondering if there is a way to obtain term vectors in this way and
that I am just not treating doc_1_tfv and tfv_iter in the right way, or if
there is a different, better way to get term vectors for documents in a
PyLucene index, or if this isn't something that Lucene should be used for.
Thank you very much for any help you can provide.
Mike

Getting term vectors/computing cosine similarity

Reply via email to