We should probably talk about "internal" Lucene document IDs and "external" or "rebased" Lucene document IDs. The internal document IDs are always "per-segment" and never, ever change for that closed segment. But... the application would not normally see these IDs. Usually the externally visible Lucene document IDs have been "rebased" to add the sum total count of documents (both existing and deleted) of all preceding segments to the document IDs of a given segment, producing a "global" (across the full index of all segments) Lucene document ID.

So, if you have those three segments, with deleted documents in the first two segments, and then merge those first two segments, the externally-visible Lucene document IDs for the third segment will suddenly all be different, shifted lower by the number of deleted documents that were just merged away, even though nothing changed in the third segment itself.

Maybe these should be called "local" (to the segment) Lucene document IDs and "global" (across all segment) Lucene document IDs. Or, maybe internal vs. external is good enough.

In short, it is completely safe to use and save Lucene document IDs, but only as long as no merging of segments is performed. Even one tiny merge and all subsequent saved document IDs are invalidated. Be careful with your merge policy - normally merges are happening in the background, automatically.

-- Jack Krupansky

-----Original Message----- From: Erick Erickson
Sent: Sunday, November 24, 2013 8:31 AM
To: solr-user@lucene.apache.org
Subject: Re: building custom cache - using lucene docids

bq: Do i understand you correctly that when two segmets get merged, the
docids
(of the original segments) remain the same?

The original segments are unchanged, segments are _never_ changed after
they're closed. But they'll be thrown away. Say you have segment1 and
segment2 that get merged into segment3. As soon as the last searcher
that is looking at segment1 and segment2 is closed, those two segments
will be deleted from your disk.

But for any given doc, the docid in segment3 will very likely be different
than it was in segment1 or 2.

I think you're reading too much into LUCENE-2897. I'm pretty sure the
segment in question is not available to you anyway before this rewrite is
done,
but freely admit I don't know much about it.

You're probably going to get into the whole PerSegment family of operations,
which is something I'm not all that familiar with so I'll leave
explanations
to others.


On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla <roman.ch...@gmail.com> wrote:

Hi Erick,
Many thanks for the info. An additional question:

Do i understand you correctly that when two segmets get merged, the docids
(of the original segments) remain the same?

(unless, perhaps in situation, they were merged using the last index
segment which was opened for writing and where the docids could have
suddenly changed in a commit just before the merge)

Yes, you guessed right that I am putting my code into the custom cache - so
it gets notified on index changes. I don't know yet how, but I think I can
find the way to the current active, opened (last) index segment. Which is
actively updated (as opposed to just being merged) -- so my definition of
'not last ones' is: where docids don't change. I'd be grateful if someone
could spot any problem with such assumption.

roman




On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson <erickerick...@gmail.com
>wrote:

> bq: But can I assume
> that docids in other segments (other than the last one) will be
relatively
> stable?
>
> Kinda. Maybe. Maybe not. It depends on how you define "other than the
> last one".
>
> The key is that the internal doc IDs may change when segments are
> merged. And old segments get merged. Doc IDs will _never_ change
> in a segment once it's closed (although as you note they may be
> marked as deleted). But that segment may be written to a new segment
> when merging and the internal ID for a given document in the new
> segment bears no relationship to internal ID in the old segment.
>
> BTW, I think you only really care when opening a new searchers. There is
> a UserCache (see solrconfig.xml) that gets notified when a new searcher
> is being opened to give it an opportunity to refresh itself, is that
> useful?
>
> As long as a searcher is open, it's guaranteed that nothing is changing.
> Hard commits with openSearcher=false don't open new searchers, which
> is why changes aren't visible until a softCommit or a hard commit with
> openSearcher=true despite the fact that the segments are closed.
>
> FWIW,
> Erick
>
> Best
> Erick
>
>
>
> On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
>
> > Hi,
> > docids are 'ephemeral', but i'd still like to build a search cache > > with
> > them (they allow for the fastest joins).
> >
> > i'm seeing docids keep changing with updates (especially, in the last
> index
> > segment) - as per
> > https://issues.apache.org/jira/browse/LUCENE-2897
> >
> > That would be fine, because i could build the cache from diff (of > > index
> > state) + reading the latest index segment in its entirety. But can I
> assume
> > that docids in other segments (other than the last one) will be
> relatively
> > stable? (ie. when an old doc is deleted, the docid is marked as
removed;
> > update doc = delete old & create a new docid)?
> >
> > thanks
> >
> > roman
> >
>


Reply via email to