Keep in mind that you'll have to store the length as you index. If you tried to store the length with each document as a post-step, you'd delete and re-add the document to the index...
That said, it's really up to you. It's very quick to use TermEnum/ TermDocs to enumerate all the lengths. Even though this works with Lucene doc IDs, it's OK since you're working on a snapshot of the index. They won't change before you close your reader (and presumably re-read the data). Or, you can simply create some sort of unique ID for each doc that's entirely independent of the Lucene ID and store *that* id along with the length in your meta-data. Whichever you think would suite your needs better. Which is best you'll only discover by testing in your situation. I suspect either will be "good enough". Erick On 2/27/07, Mike O'Leary <[EMAIL PROTECTED]> wrote:
So if I wanted to record the length of each individual document, would it be better to store that information with each document, perhaps as an unindexed field? Or are there ways to refer to the indexed documents that don't change through delete and optimize steps? Thanks. Mike O'Leary _____ From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 27, 2007 9:22 AM To: java-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Storing extra data in index You can just add a document. I used this technique in an application, and it hinges upon realizing that not all documents in an index need to have the same fields. So, say your regular documents have fields f1, f2, f3...fn. Create a special document with fields s1, s2, s3, s4 that contain your meta data. Whenever you add more "regular" documents to the index you can modify the special document as necessary. The beauty of this is that as long as the special document contains no fields in common with your regular documents, you'll never have it returned by searches because the fields are disjoint. And searches to find it will be very fast because there's only one. You can take this as far as you like. For instance, you could store a field (no need to even index it!) that contains, say, an XML version of all the meta-data you want to use in your special document. Perhaps you want to read this document in at startup and store it in a convenient form. Or..... If you go this route, you may want to consider creating and storing the meta-document as a post-build step. I was surprised at how quickly I could traverse an index and build up the meta-data document after I'd finished with all of the "regular" processing. One caution, however; I'd be very careful about storing Lucene document Ids in my meta-data document since they may change if you delete documents and then optimize your index. In fact, they WILL change. BTW, I thoroughly approve of keeping all the parts you can in the index, since that's fewer things to keep track of. Hope this helps Erick On 2/27/07, Mike O'Leary <[EMAIL PROTECTED]> wrote: Is there a standard programming idiom for adding extra data to an index that has been created? I am trying to write code to index and search a set of documents using the BM25 algorithm, so (as I understand it) I need to store the length of each document somewhere and the average document length for the collection somewhere (and, I guess, the number of documents that have been indexed at any point so I can keep a running average). It seems like it would make sense to store these values in the index somehow so they are available to the search code. Is there sample code somewhere that describes how to do something like this? Or is there a better way that I'm not thinking of? Thanks. Mike O'Leary