Iterating over all documents in an index

Georger Araujo Sat, 12 Feb 2011 06:08:07 -0800

Hi,
I want to iterate over all documents in a given index. I've found the
following piece of code [1]:


IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
    if (reader.isDeleted(i))
        continue;

    Document doc = reader.document(i);
    String docId = doc.get("docId");

    // do something with docId here...
}

I implemented it in my code and it worked fine. After that, I found out
about MatchAllDocsQuery.
I am not concerned with scoring nor sorting - all I want to do is iterate
over all documents in the index and collect their terms. My ultimate goal is
to build a bag-of-words of all documents and their terms so that I can run a
clustering algorithm on it.I've also found out about Mahout's built-in
vector creation utility [2], but I need to do this task from my own code.

I ask, what is the recommended approach?

[1]
http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index
[2]
https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text

Regards,

Georger

Iterating over all documents in an index

Reply via email to