Hi!

I promised karl that I'd share something on this topic, so here it goes. It fits the subject, too ;)

On Jan 27, 2007, at 6:14 PM, Erick Erickson wrote:

I believe you are correct about when document IDs change. That said, I'd strongly recommend you spend some time trying think of a way to keep from
doing this, since it may lead to endless synchronization issues.

Not only that, but it will make a performance hog, too. There's the potential chance that all doc ids change after only one delete and a subsequent merge! It could invalidate your entire mapping table in one go, forcing you to rebuild it from scratch.

And if you have IndexReaders that happen to have different versions of the index open at the same time, it's not even doable in one lookup table.

But if you must, you can retrieve a document with IndexReader.document(id);

On 1/27/07, saikrishna venkata pendyala <[EMAIL PROTECTED]> wrote:

Hai ,
      I was trying to store to document id's external.

I've recently had the need to do this, too.

Short story: Don't bother.

Slightly longer story: Don't bother. If you have to have a primary key to quickly get at your document (because you are running into performance troubles, e.g. with filter bitsets) turn the problem around. Don't try to store Lucene's doc ids externally but rather store the external primary key internally in Lucene. Not only as a field, but in a separate structure. Be aware that you have to change Lucene to add support for this.

I have found that lucene generates document id's linearly starting
from 0 and are not changed until any document is deleted.
       but it did work for me.

As long as you don't delete anything, this should hold true, yes. The "rearranging" happens implicitely when segment merges happen. Note that the document id is actually a misnomer, and exposing it could be considered a misfeature (IMHO). It actually is the offset into the index file.

Was the above one correct ? if not who could I store document id's
externally.

As I said, turn the problem around. Store the mapping information alongside the index and write code that is able to map between your primary key and Lucene document ids. You will have to augment the SegmentMerger and a couple of other places to get to the information, of course, but it's not that hard to do. If you have a huge number of documents, you probably won't be able to hold the entire mapping table in RAM, then it gets harder to do it efficiently. But if you have enough RAM, use java.nio buffers! HashMaps and the like won't scale a bit.

In the end it all comes down to the question: Why do you need this? If you need to create huge filter sets, you probably have a valid reason, if you simply want to retrieve a certain document, you probably don't. In the latter case, store your external id (primary key) as a field of each document and use a TermEnum to do the lookup. That's plenty fast for most applications, as I have found out, but it won't scale to big filter sets, esp. when you have a large number of updates per second on the index, thus invalidating any filter caches you might have. If you have a rather static index and do not require subsecond response times from Lucene, all of what I outlined earlier is probably overkill.

But it is a good opportunity to learn things about Lucene you never wanted to know ;)

Hope that helps a bit! I'm sorry that I can't share any code, but I'm not allowed to do so :(

cheers,
-k
--
Kay Röpke
http://classdump.org/





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to