Re: Getting term vectors/computing cosine similarity

Aric Coady Wed, 28 May 2014 10:00:46 -0700

On May 28, 2014, at 12:03 AM, Michael O'Leary <mich...@moz.com> wrote:
> Hi Andi,
> Thanks for the help. I just tried to import TVTermsEnum so I could try
> casting my iter, and I don't see how to do it since TVTermsEnum is a
> private class with fully qualified
> name 
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.
> I tried


Cast the TermsEnum object with BytesRefIterator.cast_.  Then it will have a 
next method, and be python-iterable.

Here’s an example that outputs the term vectors as a generator.  Look at the 
vector method just above:
https://pythonhosted.org/lupyne/_modules/lupyne/engine/indexers.html#IndexReader.termvector

> from org.apache.lucene.codecs.compressing import
> CompressingTermVectorsReader$TVTermsEnum
> from org.apache.lucene.codecs.compressing import TVTermsEnum
> and
> import org.apache.lucene.codecs.compressing
> 
> but none of them provided access to TVTermsEnum (the first two raised
> exceptions). After running import org.apache.lucene.codecs.compressing, I
> could do dir(org.apache.lucene.codecs.compressing) and see the contents of
> that module. CompressingTermVectorsReader was listed, but TVTermsEnum
> wasn't. TVTermsEnum also wasn't listed in the output of
> dir(org.apache.lucene.codecs.compressing.CompressingTermVectorsReader). So
> it looks like my first problem is how to get access to TVTermsEnum.
> Mike
> 
> 
> On Tue, May 27, 2014 at 11:10 PM, Andi Vajda <va...@apache.org> wrote:
> 
>> 
>>> On May 27, 2014, at 19:17, "Michael O'Leary" <mich...@moz.com> wrote:
>>> 
>>> *tl;dnr*: a next() method is defined for the Java class TVTermsEnum in
>>> Lucene 4.8.1, but it looks like there is no next() method available for
>> an
>>> object that looks like it is an instance of the Python class TVTermsEnum
>> in
>>> PyLucene 4.8.1.
>> 
>> If there is a next() method, there is a good chance the object is even
>> iterable (in the python sense). You may need to cast it first, though, as
>> the api that returned it to you may not be defined to return TVTermsEnum:
>>  TVTermsEnum.cast_(obj)
>> 
>> A good place for PyLucene code examples is its suite of unit tests. It
>> also has a few samples - way less than in 3.x releases because the APIs
>> changed too much.
>> I'm pretty sure there is a test involving TermsEnum in the tests directory.
>> 
>> Andi..
>> 
>>> I have a set of documents that I would like to cluster. These documents
>>> share a vocabulary of only about 3,000 unique terms, but there are about
>>> 15,000,000 documents. One way I thought of doing this would be to index
>> the
>>> documents using PyLucene (Python is the preferred programming language at
>>> work), obtain term vectors for the documents using PyLucene API
>> functions,
>>> and calculate cosine similarities between pairs of term vectors in order
>> to
>>> determine which documents are close to each other.
>>> 
>>> I found some sample Java code on the web that various people have posted
>>> showing ways to do this with older versions of Lucene. I downloaded
>>> PyLucene 4.8.1 and compared its API functions with the ones used in the
>>> code samples, and saw that this is an area of Lucene that has changed
>> quite
>>> a bit. I can send an email to the lucene-user mailing group to ask what
>>> would be a good way of doing this using version 4.8.1, but the question I
>>> have for this mailing group has to do with some Java API functions that
>> it
>>> looks like are not exposed in Python, unless I have to go about accessing
>>> them in a different way.
>>> 
>>> If I obtain the term vector for the field "cat_ids" in a document with id
>>> doc_id_1
>>> 
>>> doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids")
>>> 
>>> then doc_1_tfv is displayed as this object:
>>> 
>>> <Terms:
>>> 
>> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396
>>> 
>>> In some of the sample code I looked at, the terms in doc_1_tfv could be
>>> obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a
>>> member function of Terms or its subclasses any more. In another code
>>> sample, an iterator for the term vector is obtained via tfv_iter =
>>> doc_1_tfv.iterator(None) and then the terms are obtained one by one with
>>> calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this
>>> value:
>>> 
>>> <TermsEnum:
>>> 
>> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369
>>> 
>>> and there is a next() function defined for the TVTermsEnum class, but
>> this
>>> object doesn't list next() as one of its member functions and an
>> exception
>>> is raised if it is called. It looks like the object only supports the
>>> member functions defined for the TermsEnum class, and next() is not one
>> of
>>> them. Is this the case, or is there a way have it support all of the
>>> TVTermsEnum member functions, including next()? TVTermsEnum is a private
>>> class in CompressingTermVectorsReader.java.
>>> 
>>> So I am wondering if there is a way to obtain term vectors in this way
>> and
>>> that I am just not treating doc_1_tfv and tfv_iter in the right way, or
>> if
>>> there is a different, better way to get term vectors for documents in a
>>> PyLucene index, or if this isn't something that Lucene should be used
>> for.
>>> Thank you very much for any help you can provide.
>>> Mike
>>

Re: Getting term vectors/computing cosine similarity

Reply via email to