Hi, I want to iterate over all documents in a given index. I've found the following piece of code [1]:
IndexReader reader = // create IndexReader for (int i=0; i<reader.maxDoc(); i++) { if (reader.isDeleted(i)) continue; Document doc = reader.document(i); String docId = doc.get("docId"); // do something with docId here... } I implemented it in my code and it worked fine. After that, I found out about MatchAllDocsQuery. I am not concerned with scoring nor sorting - all I want to do is iterate over all documents in the index and collect their terms. My ultimate goal is to build a bag-of-words of all documents and their terms so that I can run a clustering algorithm on it.I've also found out about Mahout's built-in vector creation utility [2], but I need to do this task from my own code. I ask, what is the recommended approach? [1] http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index [2] https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text Regards, Georger