*tl;dnr*: a next() method is defined for the Java class TVTermsEnum in Lucene 4.8.1, but it looks like there is no next() method available for an object that looks like it is an instance of the Python class TVTermsEnum in PyLucene 4.8.1.
I have a set of documents that I would like to cluster. These documents share a vocabulary of only about 3,000 unique terms, but there are about 15,000,000 documents. One way I thought of doing this would be to index the documents using PyLucene (Python is the preferred programming language at work), obtain term vectors for the documents using PyLucene API functions, and calculate cosine similarities between pairs of term vectors in order to determine which documents are close to each other. I found some sample Java code on the web that various people have posted showing ways to do this with older versions of Lucene. I downloaded PyLucene 4.8.1 and compared its API functions with the ones used in the code samples, and saw that this is an area of Lucene that has changed quite a bit. I can send an email to the lucene-user mailing group to ask what would be a good way of doing this using version 4.8.1, but the question I have for this mailing group has to do with some Java API functions that it looks like are not exposed in Python, unless I have to go about accessing them in a different way. If I obtain the term vector for the field "cat_ids" in a document with id doc_id_1 doc_1_tfv = reader.getTermVector(doc_id_1, "cat_ids") then doc_1_tfv is displayed as this object: <Terms: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTerms@32c46396 > In some of the sample code I looked at, the terms in doc_1_tfv could be obtained with doc_1_tfv.getTerms(), but it looks like getTerms is not a member function of Terms or its subclasses any more. In another code sample, an iterator for the term vector is obtained via tfv_iter = doc_1_tfv.iterator(None) and then the terms are obtained one by one with calls to tfv_iter.next(). This is where I get stuck. tfv_iter has this value: <TermsEnum: org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum@1cca2369 > and there is a next() function defined for the TVTermsEnum class, but this object doesn't list next() as one of its member functions and an exception is raised if it is called. It looks like the object only supports the member functions defined for the TermsEnum class, and next() is not one of them. Is this the case, or is there a way have it support all of the TVTermsEnum member functions, including next()? TVTermsEnum is a private class in CompressingTermVectorsReader.java. So I am wondering if there is a way to obtain term vectors in this way and that I am just not treating doc_1_tfv and tfv_iter in the right way, or if there is a different, better way to get term vectors for documents in a PyLucene index, or if this isn't something that Lucene should be used for. Thank you very much for any help you can provide. Mike
