Hi,
I want to iterate over all documents in a given index. I've found the
following piece of code [1]:
IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
// do something with docId here...
}
I implemented it in my code and it worked fine. After that, I found out
about MatchAllDocsQuery.
I am not concerned with scoring nor sorting - all I want to do is iterate
over all documents in the index and collect their terms. My ultimate goal is
to build a bag-of-words of all documents and their terms so that I can run a
clustering algorithm on it.I've also found out about Mahout's built-in
vector creation utility [2], but I need to do this task from my own code.
I ask, what is the recommended approach?
[1]
http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index
[2]
https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text
Regards,
Georger