Re: Iterating over all documents in an index

Erick Erickson Sat, 12 Feb 2011 09:30:58 -0800

Be aware that when you do a doc.get(), the fields are the
*stored* fields in their original, unanalyzed form. Is that really
what you want? Or do you want the tokenized form of the fields?


If the latter, you might get the Luke code, it reconstructs all the fields
in the document from the terms that are actually indexed. Note two
things: 1> it's slow. You're really undoing all the work that went into
inverting the index in the first place.
2> it's lossy. For instance, a term that's been stemmed will only have
the stemmed version in the index. Is that OK?

Best
Erick

On Sat, Feb 12, 2011 at 9:07 AM, Georger Araujo
<[email protected]> wrote:
> Hi,
> I want to iterate over all documents in a given index. I've found the
> following piece of code [1]:
>
> IndexReader reader = // create IndexReader
> for (int i=0; i<reader.maxDoc(); i++) {
>    if (reader.isDeleted(i))
>        continue;
>
>    Document doc = reader.document(i);
>    String docId = doc.get("docId");
>
>    // do something with docId here...
> }
>
> I implemented it in my code and it worked fine. After that, I found out
> about MatchAllDocsQuery.
> I am not concerned with scoring nor sorting - all I want to do is iterate
> over all documents in the index and collect their terms. My ultimate goal is
> to build a bag-of-words of all documents and their terms so that I can run a
> clustering algorithm on it.I've also found out about Mahout's built-in
> vector creation utility [2], but I need to do this task from my own code.
>
> I ask, what is the recommended approach?
>
> [1]
> http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index
> [2]
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text
>
> Regards,
>
> Georger
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Iterating over all documents in an index

Reply via email to