Hi Erick, Thanks for the input. I didn't really do the doc.get() in my code, because what I'm really looking for are the stemmed terms in the index. I only borrowed the idea from the snippet. Here's how my code looks like now:
Map<String,Integer> termMap = new TreeMap<String,Integer>(); List<String> filenames = new ArrayList<String>( reader.maxDoc() ); for ( int i = 0; i < reader.maxDoc(); i++ ) { if ( reader.isDeleted( i ) ) continue; Document doc = reader.document( i ); filenames.add( i, doc.get( "filename" ) ); TermFreqVector tfVector = reader.getTermFreqVector( i, "filecontents" ); for ( int j = 0; j < tfVector.getTerms().length; j++ ) { // Add terms to the map... } } The point is that I need to read the term frequency vector for each document. I actually want the stemmed terms. From that and the filenames (stored in the "filename" field), i was able to build a matrix that looks like this: term1 term2 ... termn doc1 1 1 0 doc2 0 1 0 ... docn 1 0 1 I already have this working. I just don't know if I'm doing it the simplest/quickest way, though - and that's going to be an issue because I intend to run the clustering algorithm on matrices as big as 5000 x 100000. Regards, Georger 2011/2/12 Erick Erickson <erickerick...@gmail.com> > Be aware that when you do a doc.get(), the fields are the > *stored* fields in their original, unanalyzed form. Is that really > what you want? Or do you want the tokenized form of the fields? > > If the latter, you might get the Luke code, it reconstructs all the fields > in the document from the terms that are actually indexed. Note two > things: 1> it's slow. You're really undoing all the work that went into > inverting the index in the first place. > 2> it's lossy. For instance, a term that's been stemmed will only have > the stemmed version in the index. Is that OK? > > Best > Erick > > On Sat, Feb 12, 2011 at 9:07 AM, Georger Araujo > <georger.ara...@gmail.com> wrote: > > Hi, > > I want to iterate over all documents in a given index. I've found the > > following piece of code [1]: > > > > IndexReader reader = // create IndexReader > > for (int i=0; i<reader.maxDoc(); i++) { > > if (reader.isDeleted(i)) > > continue; > > > > Document doc = reader.document(i); > > String docId = doc.get("docId"); > > > > // do something with docId here... > > } > > > > I implemented it in my code and it worked fine. After that, I found out > > about MatchAllDocsQuery. > > I am not concerned with scoring nor sorting - all I want to do is iterate > > over all documents in the index and collect their terms. My ultimate goal > is > > to build a bag-of-words of all documents and their terms so that I can > run a > > clustering algorithm on it.I've also found out about Mahout's built-in > > vector creation utility [2], but I need to do this task from my own code. > > > > I ask, what is the recommended approach? > > > > [1] > > > http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index > > [2] > > > https://cwiki.apache.org/confluence/display/MAHOUT/Creating%20Vectors%20from%20Text > > > > Regards, > > > > Georger > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >