Keep in mind that you'll have to store the length as you index. If you
tried to store the length with each document as a post-step, you'd
delete and re-add the document to the index...

That said, it's really up to you. It's very quick to use TermEnum/
TermDocs to enumerate all the lengths. Even though this works
with Lucene doc IDs, it's OK since you're working on a
snapshot of the index. They won't change before you
close your reader (and presumably re-read the data).

Or, you can simply create some sort of unique ID for each
doc that's entirely independent of the Lucene ID and store
*that* id along with the length in your meta-data. Whichever
you think would suite your needs better.

Which is best you'll only discover by testing in your situation.
I suspect either will be "good enough".

Erick

On 2/27/07, Mike O'Leary <[EMAIL PROTECTED]> wrote:

So if I wanted to record the length of each individual document, would it
be
better to store that information with each document, perhaps as an
unindexed
field? Or are there ways to refer to the indexed documents that don't
change
through delete and optimize steps? Thanks.

Mike O'Leary

  _____

From: Erick Erickson [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 27, 2007 9:22 AM
To: java-user@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Storing extra data in index



You can just add a document. I used this technique in an application,
and it hinges upon realizing that not all documents in an index need
to have the same fields. So, say your regular documents have
fields f1, f2, f3...fn. Create a special document with fields
s1, s2, s3, s4 that contain your meta data. Whenever you add
more "regular" documents to the index you can modify
the special document as necessary.

The beauty of this is that as long as the special document
contains no fields in common with your regular documents,
you'll never have it returned by searches because the fields
are disjoint. And searches to find it will be very fast because
there's only one.

You can take this as far as you like. For instance, you
could store a field (no need to even index it!) that
contains, say, an XML version of all the meta-data
you want to use in your special document. Perhaps
you want to read this document in at startup and
store it in a convenient form. Or.....

If you go this route, you may want to consider creating and storing
the meta-document as a post-build step. I was surprised at how
quickly I could traverse an index and build up the meta-data
document after I'd finished with all of the "regular" processing.

One caution, however; I'd be very careful about storing Lucene
document Ids in my meta-data document since they may change
if you delete documents and then optimize your index. In fact, they
WILL change.

BTW, I thoroughly approve of keeping all the parts you can
in the index, since that's fewer things to keep track of.

Hope this helps
Erick

On 2/27/07, Mike O'Leary <[EMAIL PROTECTED]> wrote:

Is there a standard programming idiom for adding extra data to an index
that
has been created? I am trying to write code to index and search a set of
documents using the BM25 algorithm, so (as I understand it) I need to
store
the length of each document somewhere and the average document length for
the collection somewhere (and, I guess, the number of documents that have
been indexed at any point so I can keep a running average). It seems like
it

would make sense to store these values in the index somehow so they are
available to the search code. Is there sample code somewhere that
describes
how to do something like this? Or is there a better way that I'm not
thinking of? Thanks.

Mike O'Leary




Reply via email to