Re: Re : lucene document id's

Kay Roepke Sat, 27 Jan 2007 16:38:11 -0800

Hi!

I promised karl that I'd share something on this topic, so here itgoes. It fits the subject, too ;)


On Jan 27, 2007, at 6:14 PM, Erick Erickson wrote:

I believe you are correct about when document IDs change. Thatsaid, I'dstrongly recommend you spend some time trying think of a way tokeep from
doing this, since it may lead to endless synchronization issues.

Not only that, but it will make a performance hog, too. There's thepotential chance that all doc ids change after only one delete and asubsequent merge! It could invalidate your entire mapping table inone go, forcing you to rebuild it from scratch.

And if you have IndexReaders that happen to have different versionsof the index open at the same time, it's not even doable in onelookup table.

But if you must, you can retrieve a document withIndexReader.document(id);
On 1/27/07, saikrishna venkata pendyala <[EMAIL PROTECTED]>wrote:
Hai ,
      I was trying to store to document id's external.


I've recently had the need to do this, too.

Short story: Don't bother.

Slightly longer story: Don't bother. If you have to have a primarykey to quickly get at your document (because you are running intoperformance troubles, e.g. with filter bitsets) turn the problemaround. Don't try to store Lucene's doc ids externally but ratherstore the external primary key internally in Lucene. Not only as afield, but in a separate structure. Be aware that you have to changeLucene to add support for this.

I have found that lucene generates document id's linearlystarting
from 0 and are not changed until any document is deleted.
       but it did work for me.

As long as you don't delete anything, this should hold true, yes. The"rearranging" happens implicitely when segment merges happen. Notethat the document id is actually a misnomer, and exposing it could beconsidered a misfeature (IMHO). It actually is the offset into theindex file.

Was the above one correct ? if not who could I storedocument id's
externally.

As I said, turn the problem around. Store the mapping informationalongside the index and write code that is able to map between yourprimary key and Lucene document ids. You will have to augment theSegmentMerger and a couple of other places to get to the information,of course, but it's not that hard to do. If you have a huge number ofdocuments, you probably won't be able to hold the entire mappingtable in RAM, then it gets harder to do it efficiently. But if youhave enough RAM, use java.nio buffers! HashMaps and the like won'tscale a bit.

In the end it all comes down to the question: Why do you need this?If you need to create huge filter sets, you probably have a validreason, if you simply want to retrieve a certain document, youprobably don't. In the latter case, store your external id (primarykey) as a field of each document and use a TermEnum to do the lookup.That's plenty fast for most applications, as I have found out, but itwon't scale to big filter sets, esp. when you have a large number ofupdates per second on the index, thus invalidating any filter cachesyou might have.If you have a rather static index and do not require subsecondresponse times from Lucene, all of what I outlined earlier isprobably overkill.

But it is a good opportunity to learn things about Lucene you neverwanted to know ;)

Hope that helps a bit! I'm sorry that I can't share any code, but I'mnot allowed to do so :(


cheers,
-k
--
Kay Röpke
http://classdump.org/





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Re : lucene document id's

Reply via email to