On May 28, 2014, at 12:03 AM, Michael O'Leary <mich...@moz.com> wrote: > Hi Andi, > Thanks for the help. I just tried to import TVTermsEnum so I could try > casting my iter, and I don't see how to do it since TVTermsEnum is a > private class with fully qualified > name > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum. > I tried
Cast the TermsEnum object with BytesRefIterator.cast_. Then it will have a next method, and be python-iterable. Here’s an example that outputs the term vectors as a generator. Look at the vector method just above: https://pythonhosted.org/lupyne/_modules/lupyne/engine/indexers.html#IndexReader.termvector > from org.apache.lucene.codecs.compressing import > CompressingTermVectorsReader$TVTermsEnum > from org.apache.lucene.codecs.compressing import TVTermsEnum > and > import org.apache.lucene.codecs.compressing > > but none of them provided access to TVTermsEnum (the first two raised > exceptions). After running import org.apache.lucene.codecs.compressing, I > could do dir(org.apache.lucene.codecs.compressing) and see the contents of > that module. CompressingTermVectorsReader was listed, but TVTermsEnum > wasn't. TVTermsEnum also wasn't listed in the output of > dir(org.apache.lucene.codecs.compressing.CompressingTermVectorsReader). So > it looks like my first problem is how to get access to TVTermsEnum. > Mike > > > On Tue, May 27, 2014 at 11:10 PM, Andi Vajda <va...@apache.org> wrote: > >> >>> On May 27, 2014, at 19:17, "Michael O'Leary" <mich...@moz.com> wrote: >>> >>> *tl;dnr*: a next() method is defined for the Java class TVTermsEnum in >>> Lucene 4.8.1, but it looks like there is no next() method available for >> an >>> object that looks like it is an instance of the Python class TVTermsEnum >> in >>> PyLucene 4.8.1. >> >> If there is a next() method, there is a good chance the object is even >> iterable (in the python sense). You may need to cast it first, though, as >> the api that returned it to you may not be defined to return TVTermsEnum: >> TVTermsEnum.cast_(obj) >> >> A good place for PyLucene code examples is its suite of unit tests. It >> also has a few samples - way less than in 3.x releases because the APIs >> changed too much. >> I'm pretty sure there is a test involving TermsEnum in the tests directory. >> >> Andi.. >> >>> I have a set of documents that I would like to cluster. These documents >>> share a vocabulary of only about 3,000 unique terms, but there are about >>> 15,000,000 documents. One way I thought of doing this would be to index >> the >>> documents using PyLucene (Python is the preferred programming language at >>> work), obtain term vectors for the documents using PyLucene API >> functions, >>> and calculate cosine similarities between pairs of term vectors in order >> to >>> determine which documents are close to each other. >>> >>> I found some sample Java code on the web that various people have posted >>> showing ways to do this with older versions of Lucene. I downloaded >>> PyLucene 4.8.1 and compared its API functions with the ones used in the >>> code samples, and saw that this is an area of Lucene that has changed >> quite >>> a bit. I can send an email to the lucene-user mailing group to ask what >>> would be a good way of doing this using version 4.8.1, but the question I >>> have for this mailing group has to do with some Java API functions that >> it >>> looks like are not exposed in Python, unless I have to go about accessing >>> them in a different way. >>> >>> If I obtain the term vector for the field "cat_ids" in a document with id >>> doc_id_1 >>> >>> doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids") >>> >>> then doc_1_tfv is displayed as this object: >>> >>> <Terms: >>> >> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396 >>> >>> In some of the sample code I looked at, the terms in doc_1_tfv could be >>> obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a >>> member function of Terms or its subclasses any more. In another code >>> sample, an iterator for the term vector is obtained via tfv_iter = >>> doc_1_tfv.iterator(None) and then the terms are obtained one by one with >>> calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this >>> value: >>> >>> <TermsEnum: >>> >> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369 >>> >>> and there is a next() function defined for the TVTermsEnum class, but >> this >>> object doesn't list next() as one of its member functions and an >> exception >>> is raised if it is called. It looks like the object only supports the >>> member functions defined for the TermsEnum class, and next() is not one >> of >>> them. Is this the case, or is there a way have it support all of the >>> TVTermsEnum member functions, including next()? TVTermsEnum is a private >>> class in CompressingTermVectorsReader.java. >>> >>> So I am wondering if there is a way to obtain term vectors in this way >> and >>> that I am just not treating doc_1_tfv and tfv_iter in the right way, or >> if >>> there is a different, better way to get term vectors for documents in a >>> PyLucene index, or if this isn't something that Lucene should be used >> for. >>> Thank you very much for any help you can provide. >>> Mike >>