Be aware that when you do a doc.get(), the fields are the *stored* fields in their original, unanalyzed form. Is that really what you want? Or do you want the tokenized form of the fields?
If the latter, you might get the Luke code, it reconstructs all the fields in the document from the terms that are actually indexed. Note two things: 1> it's slow. You're really undoing all the work that went into inverting the index in the first place. 2> it's lossy. For instance, a term that's been stemmed will only have the stemmed version in the index. Is that OK? Best Erick On Sat, Feb 12, 2011 at 9:07 AM, Georger Araujo <[email protected]> wrote: > Hi, > I want to iterate over all documents in a given index. I've found the > following piece of code [1]: > > IndexReader reader = // create IndexReader > for (int i=0; i<reader.maxDoc(); i++) { > if (reader.isDeleted(i)) > continue; > > Document doc = reader.document(i); > String docId = doc.get("docId"); > > // do something with docId here... > } > > I implemented it in my code and it worked fine. After that, I found out > about MatchAllDocsQuery. > I am not concerned with scoring nor sorting - all I want to do is iterate > over all documents in the index and collect their terms. My ultimate goal is > to build a bag-of-words of all documents and their terms so that I can run a > clustering algorithm on it.I've also found out about Mahout's built-in > vector creation utility [2], but I need to do this task from my own code. > > I ask, what is the recommended approach? > > [1] > http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index > [2] > https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text > > Regards, > > Georger > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
