Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
On Sun, Nov 24, 2013 at 8:31 AM, Erick Erickson erickerick...@gmail.comwrote:

 bq: Do i understand you correctly that when two segmets get merged, the
 docids
 (of the original segments) remain the same?

 The original segments are unchanged, segments are _never_ changed after
 they're closed. But they'll be thrown away. Say you have segment1 and
 segment2 that get merged into segment3. As soon as the last searcher
 that is looking at segment1 and segment2 is closed, those two segments
 will be deleted from your disk.

 But for any given doc, the docid in segment3 will very likely be different
 than it was in segment1 or 2.


i'm trying to figure this out - i'll have to dig, i suppose. for example,
if the docbase (the docid offset per searcher) was stored together with the
index segment, that would be an indication of 'relative stability of docids'



 I think you're reading too much into LUCENE-2897. I'm pretty sure the
 segment in question is not available to you anyway before this rewrite is
 done,
 but freely admit I don't know much about it.


i've done tests, committing and overwriting a document and saw (SOLR4.0)
that docids are being recycled. I deleted 2 docs, then added a new document
and guess what: the new document had the docid of the previously deleted
document (but different fields).

That was new to me, so I searched and found the LUCENE-2897 which seemed to
explain that behaviour.



 You're probably going to get into the whole PerSegment family of
 operations,
 which is something I'm not all that familiar with so I'll leave
 explanations
 to others.


Thank you, it is useful to get insights from various sides,

  roman



 On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Erick,
  Many thanks for the info. An additional question:
 
  Do i understand you correctly that when two segmets get merged, the
 docids
  (of the original segments) remain the same?
 
  (unless, perhaps in situation, they were merged using the last index
  segment which was opened for writing and where the docids could have
  suddenly changed in a commit just before the merge)
 
  Yes, you guessed right that I am putting my code into the custom cache -
 so
  it gets notified on index changes. I don't know yet how, but I think I
 can
  find the way to the current active, opened (last) index segment. Which is
  actively updated (as opposed to just being merged) -- so my definition of
  'not last ones' is: where docids don't change. I'd be grateful if someone
  could spot any problem with such assumption.
 
  roman
 
 
 
 
  On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   bq: But can I assume
   that docids in other segments (other than the last one) will be
  relatively
   stable?
  
   Kinda. Maybe. Maybe not. It depends on how you define other than the
   last one.
  
   The key is that the internal doc IDs may change when segments are
   merged. And old segments get merged. Doc IDs will _never_ change
   in a segment once it's closed (although as you note they may be
   marked as deleted). But that segment may be written to a new segment
   when merging and the internal ID for a given document in the new
   segment bears no relationship to internal ID in the old segment.
  
   BTW, I think you only really care when opening a new searchers. There
 is
   a UserCache (see solrconfig.xml) that gets notified when a new searcher
   is being opened to give it an opportunity to refresh itself, is that
   useful?
  
   As long as a searcher is open, it's guaranteed that nothing is
 changing.
   Hard commits with openSearcher=false don't open new searchers, which
   is why changes aren't visible until a softCommit or a hard commit with
   openSearcher=true despite the fact that the segments are closed.
  
   FWIW,
   Erick
  
   Best
   Erick
  
  
  
   On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Hi,
docids are 'ephemeral', but i'd still like to build a search cache
 with
them (they allow for the fastest joins).
   
i'm seeing docids keep changing with updates (especially, in the last
   index
segment) - as per
https://issues.apache.org/jira/browse/LUCENE-2897
   
That would be fine, because i could build the cache from diff (of
 index
state) + reading the latest index segment in its entirety. But can I
   assume
that docids in other segments (other than the last one) will be
   relatively
stable? (ie. when an old doc is deleted, the docid is marked as
  removed;
update doc = delete old  create a new docid)?
   
thanks
   
roman
   
  
 



Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
On Sun, Nov 24, 2013 at 10:44 AM, Jack Krupansky j...@basetechnology.comwrote:

 We should probably talk about internal Lucene document IDs and
 external or rebased Lucene document IDs. The internal document IDs are
 always per-segment and never, ever change for that closed segment. But...
 the application would not normally see these IDs. Usually the externally
 visible Lucene document IDs have been rebased to add the sum total count
 of documents (both existing and deleted) of all preceding segments to the
 document IDs of a given segment, producing a global (across the full
 index of all segments) Lucene document ID.

 So, if you have those three segments, with deleted documents in the first
 two segments, and then merge those first two segments, the
 externally-visible Lucene document IDs for the third segment will suddenly
 all be different, shifted lower by the number of deleted documents that
 were just merged away, even though nothing changed in the third segment
 itself.


That's right, and I'm starting to think that if i keep the segment id and
the original offset, i don't need to rebuild that part of the cache,
because it has not been rebased (but I can always update the deleted docs).
It seems simple so I'm suspecting to find a catch somewhere. but if it
works, that could potentially speed up any cache building

Do you have information where the docbase of the segment are stored? Or
which java class I should start my exploration from? [it is somewhat
sprawling complex, so I'm bit lost :)]



 Maybe these should be called local (to the segment) Lucene document IDs
 and global (across all segment) Lucene document IDs. Or, maybe internal
 vs. external is good enough.

 In short, it is completely safe to use and save Lucene document IDs, but
 only as long as no merging of segments is performed. Even one tiny merge
 and all subsequent saved document IDs are invalidated. Be careful with your
 merge policy - normally merges are happening in the background,
 automatically.


my tests, as per previous email, showed that the last segment docid's are
not that stable. I don't know if it matters that I used the RAMDirectory
for the test, but the docids were being 'recycled' -  the deleted docs were
in the previous segment, then suddently their docids were inside newly
added documents (so maybe solr/lucene is not counting deleted docs, if they
are at the end of a segment...?) i don't know. i'll need to explore the
index segments to understand what was going on there, thanks for any
possible pointers


  roman





 -- Jack Krupansky

 -Original Message- From: Erick Erickson
 Sent: Sunday, November 24, 2013 8:31 AM
 To: solr-user@lucene.apache.org
 Subject: Re: building custom cache - using lucene docids


 bq: Do i understand you correctly that when two segmets get merged, the
 docids
 (of the original segments) remain the same?

 The original segments are unchanged, segments are _never_ changed after
 they're closed. But they'll be thrown away. Say you have segment1 and
 segment2 that get merged into segment3. As soon as the last searcher
 that is looking at segment1 and segment2 is closed, those two segments
 will be deleted from your disk.

 But for any given doc, the docid in segment3 will very likely be different
 than it was in segment1 or 2.

 I think you're reading too much into LUCENE-2897. I'm pretty sure the
 segment in question is not available to you anyway before this rewrite is
 done,
 but freely admit I don't know much about it.

 You're probably going to get into the whole PerSegment family of
 operations,
 which is something I'm not all that familiar with so I'll leave
 explanations
 to others.


 On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Erick,
 Many thanks for the info. An additional question:

 Do i understand you correctly that when two segmets get merged, the docids
 (of the original segments) remain the same?

 (unless, perhaps in situation, they were merged using the last index
 segment which was opened for writing and where the docids could have
 suddenly changed in a commit just before the merge)

 Yes, you guessed right that I am putting my code into the custom cache -
 so
 it gets notified on index changes. I don't know yet how, but I think I can
 find the way to the current active, opened (last) index segment. Which is
 actively updated (as opposed to just being merged) -- so my definition of
 'not last ones' is: where docids don't change. I'd be grateful if someone
 could spot any problem with such assumption.

 roman




 On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  bq: But can I assume
  that docids in other segments (other than the last one) will be
 relatively
  stable?
 
  Kinda. Maybe. Maybe not. It depends on how you define other than the
  last one.
 
  The key is that the internal doc IDs may change when segments are
  merged. And old segments get merged. Doc IDs will _never_ change

Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Roman,

 I don't fully understand your question. After segment is flushed it's never
 changed, hence segment-local docids are always the same. Due to merge
 segment can gone, its' docs become new ones in another segment.  This is
 true for 'global' (Solr-style) docnums, which can flip after merge is
 happened in the middle of the segments' chain.
 As well you are saying about segmented cache I can propose you to look at
 CachingWrapperFilter and NoOpRegenerator as a pattern for such data
 structures.


Thanks Mikhail, the CWF confirms that the idea of regenerating just part of
the cache is doable. The CacheRegenerators, on the other hand, make no
sense to me - and they are not given any 'signals', so they don't know if
they are in the middle of some regeneration or not, and they should not
keep a state (of previous index) - as they can be shared by threads that
build the cache

Best,

  roman




 On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
  docids are 'ephemeral', but i'd still like to build a search cache with
  them (they allow for the fastest joins).
 
  i'm seeing docids keep changing with updates (especially, in the last
 index
  segment) - as per
  https://issues.apache.org/jira/browse/LUCENE-2897
 
  That would be fine, because i could build the cache from diff (of index
  state) + reading the latest index segment in its entirety. But can I
 assume
  that docids in other segments (other than the last one) will be
 relatively
  stable? (ie. when an old doc is deleted, the docid is marked as removed;
  update doc = delete old  create a new docid)?
 
  thanks
 
  roman
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: building custom cache - using lucene docids

2013-11-25 Thread Roman Chyla
OK, I've spent some time reading the solr/lucene4x classes, and this is
myunderstanding (feel free to correct me ;-))

DirectoryReader holds the opened segments -- each segment has its own
reader, the BaseCompositeReader (or extended classes thereof) store the
offsets per each segment; eg. [0, 5, 22] - meaning, there are 2 segments,
with 5, and 17 docs respectively

The segments are listed in the segments_N file,
http://lucene.apache.org/core/3_0_3/fileformats.html#Segments
File

So theoretically, order of segments could change when merge happens - yet,
every SegmentReader is identified by unique name and this name doesn't
change unless the segment itself changed (ie. docs were deleted; or got
more docs) - so it is possible to rely on this name to know what has not
changed

the name is coming from SegmentInfo (check its toString method) -- the
SegmentInfo has a method equals() that will consider as equal the readers
with the same name and the same dir (which is useful to know - two readers,
one with deletes, one without, are equal)

Lucene's FieldCache itself is rather complex, but it shows there is a very
clever mechanism (a few actually!) -- a class can register a listener that
will be called whenever an index segments is being closed (this could be
used to invalidate portions of a cache), the relevant classes are:
SegmentReader.CoreClosedListener, IndexReader.ReaderClosedListener

But Lucene is using this mechanism only to purge the cache - so
effectively, every commits triggers cache rebuild. This is the interesting
bit: lots of work could be spared if segments data were reused  (but
admittedly, only sometimes - for data that was fully read into memory, for
anything else, such as terms, the cache reads only some values and is
fetching the rest from the index - so Lucene must close the reader and
rebuild the cache on every commit; but that is not my case, as I am to copy
values from an index, and store them in memory...)

the weird 'recyclation' of docids I've observed can probably be explained
by the fact that the index reader contains segments and near realtime
readers (but I'm not sure about this)

To conclude: it is possible to build a cache that updates itself (with only
changes committed since the last build) - this will have impact on how fast
new searcher is ready to serve requests

HTH somebody else too :)

  roman



On Mon, Nov 25, 2013 at 7:54 PM, Roman Chyla roman.ch...@gmail.com wrote:




 On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Roman,

 I don't fully understand your question. After segment is flushed it's
 never
 changed, hence segment-local docids are always the same. Due to merge
 segment can gone, its' docs become new ones in another segment.  This is
 true for 'global' (Solr-style) docnums, which can flip after merge is
 happened in the middle of the segments' chain.
 As well you are saying about segmented cache I can propose you to look at
 CachingWrapperFilter and NoOpRegenerator as a pattern for such data
 structures.


 Thanks Mikhail, the CWF confirms that the idea of regenerating just part
 of the cache is doable. The CacheRegenerators, on the other hand, make no
 sense to me - and they are not given any 'signals', so they don't know if
 they are in the middle of some regeneration or not, and they should not
 keep a state (of previous index) - as they can be shared by threads that
 build the cache

 Best,

   roman




 On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
  docids are 'ephemeral', but i'd still like to build a search cache with
  them (they allow for the fastest joins).
 
  i'm seeing docids keep changing with updates (especially, in the last
 index
  segment) - as per
  https://issues.apache.org/jira/browse/LUCENE-2897
 
  That would be fine, because i could build the cache from diff (of index
  state) + reading the latest index segment in its entirety. But can I
 assume
  that docids in other segments (other than the last one) will be
 relatively
  stable? (ie. when an old doc is deleted, the docid is marked as removed;
  update doc = delete old  create a new docid)?
 
  thanks
 
  roman
 



 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com





Re: building custom cache - using lucene docids

2013-11-24 Thread Erick Erickson
bq: Do i understand you correctly that when two segmets get merged, the
docids
(of the original segments) remain the same?

The original segments are unchanged, segments are _never_ changed after
they're closed. But they'll be thrown away. Say you have segment1 and
segment2 that get merged into segment3. As soon as the last searcher
that is looking at segment1 and segment2 is closed, those two segments
will be deleted from your disk.

But for any given doc, the docid in segment3 will very likely be different
than it was in segment1 or 2.

I think you're reading too much into LUCENE-2897. I'm pretty sure the
segment in question is not available to you anyway before this rewrite is
done,
but freely admit I don't know much about it.

You're probably going to get into the whole PerSegment family of operations,
which is something I'm not all that familiar with so I'll leave
explanations
to others.


On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Erick,
 Many thanks for the info. An additional question:

 Do i understand you correctly that when two segmets get merged, the docids
 (of the original segments) remain the same?

 (unless, perhaps in situation, they were merged using the last index
 segment which was opened for writing and where the docids could have
 suddenly changed in a commit just before the merge)

 Yes, you guessed right that I am putting my code into the custom cache - so
 it gets notified on index changes. I don't know yet how, but I think I can
 find the way to the current active, opened (last) index segment. Which is
 actively updated (as opposed to just being merged) -- so my definition of
 'not last ones' is: where docids don't change. I'd be grateful if someone
 could spot any problem with such assumption.

 roman




 On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  bq: But can I assume
  that docids in other segments (other than the last one) will be
 relatively
  stable?
 
  Kinda. Maybe. Maybe not. It depends on how you define other than the
  last one.
 
  The key is that the internal doc IDs may change when segments are
  merged. And old segments get merged. Doc IDs will _never_ change
  in a segment once it's closed (although as you note they may be
  marked as deleted). But that segment may be written to a new segment
  when merging and the internal ID for a given document in the new
  segment bears no relationship to internal ID in the old segment.
 
  BTW, I think you only really care when opening a new searchers. There is
  a UserCache (see solrconfig.xml) that gets notified when a new searcher
  is being opened to give it an opportunity to refresh itself, is that
  useful?
 
  As long as a searcher is open, it's guaranteed that nothing is changing.
  Hard commits with openSearcher=false don't open new searchers, which
  is why changes aren't visible until a softCommit or a hard commit with
  openSearcher=true despite the fact that the segments are closed.
 
  FWIW,
  Erick
 
  Best
  Erick
 
 
 
  On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Hi,
   docids are 'ephemeral', but i'd still like to build a search cache with
   them (they allow for the fastest joins).
  
   i'm seeing docids keep changing with updates (especially, in the last
  index
   segment) - as per
   https://issues.apache.org/jira/browse/LUCENE-2897
  
   That would be fine, because i could build the cache from diff (of index
   state) + reading the latest index segment in its entirety. But can I
  assume
   that docids in other segments (other than the last one) will be
  relatively
   stable? (ie. when an old doc is deleted, the docid is marked as
 removed;
   update doc = delete old  create a new docid)?
  
   thanks
  
   roman
  
 



Re: building custom cache - using lucene docids

2013-11-24 Thread Jack Krupansky
We should probably talk about internal Lucene document IDs and external 
or rebased Lucene document IDs. The internal document IDs are always 
per-segment and never, ever change for that closed segment. But... the 
application would not normally see these IDs. Usually the externally visible 
Lucene document IDs have been rebased to add the sum total count of 
documents (both existing and deleted) of all preceding segments to the 
document IDs of a given segment, producing a global (across the full index 
of all segments) Lucene document ID.


So, if you have those three segments, with deleted documents in the first 
two segments, and then merge those first two segments, the 
externally-visible Lucene document IDs for the third segment will suddenly 
all be different, shifted lower by the number of deleted documents that were 
just merged away, even though nothing changed in the third segment itself.


Maybe these should be called local (to the segment) Lucene document IDs 
and global (across all segment) Lucene document IDs. Or, maybe internal 
vs. external is good enough.


In short, it is completely safe to use and save Lucene document IDs, but 
only as long as no merging of segments is performed. Even one tiny merge and 
all subsequent saved document IDs are invalidated. Be careful with your 
merge policy - normally merges are happening in the background, 
automatically.


-- Jack Krupansky

-Original Message- 
From: Erick Erickson

Sent: Sunday, November 24, 2013 8:31 AM
To: solr-user@lucene.apache.org
Subject: Re: building custom cache - using lucene docids

bq: Do i understand you correctly that when two segmets get merged, the
docids
(of the original segments) remain the same?

The original segments are unchanged, segments are _never_ changed after
they're closed. But they'll be thrown away. Say you have segment1 and
segment2 that get merged into segment3. As soon as the last searcher
that is looking at segment1 and segment2 is closed, those two segments
will be deleted from your disk.

But for any given doc, the docid in segment3 will very likely be different
than it was in segment1 or 2.

I think you're reading too much into LUCENE-2897. I'm pretty sure the
segment in question is not available to you anyway before this rewrite is
done,
but freely admit I don't know much about it.

You're probably going to get into the whole PerSegment family of operations,
which is something I'm not all that familiar with so I'll leave
explanations
to others.


On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote:


Hi Erick,
Many thanks for the info. An additional question:

Do i understand you correctly that when two segmets get merged, the docids
(of the original segments) remain the same?

(unless, perhaps in situation, they were merged using the last index
segment which was opened for writing and where the docids could have
suddenly changed in a commit just before the merge)

Yes, you guessed right that I am putting my code into the custom cache - 
so

it gets notified on index changes. I don't know yet how, but I think I can
find the way to the current active, opened (last) index segment. Which is
actively updated (as opposed to just being merged) -- so my definition of
'not last ones' is: where docids don't change. I'd be grateful if someone
could spot any problem with such assumption.

roman




On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com
wrote:

 bq: But can I assume
 that docids in other segments (other than the last one) will be
relatively
 stable?

 Kinda. Maybe. Maybe not. It depends on how you define other than the
 last one.

 The key is that the internal doc IDs may change when segments are
 merged. And old segments get merged. Doc IDs will _never_ change
 in a segment once it's closed (although as you note they may be
 marked as deleted). But that segment may be written to a new segment
 when merging and the internal ID for a given document in the new
 segment bears no relationship to internal ID in the old segment.

 BTW, I think you only really care when opening a new searchers. There is
 a UserCache (see solrconfig.xml) that gets notified when a new searcher
 is being opened to give it an opportunity to refresh itself, is that
 useful?

 As long as a searcher is open, it's guaranteed that nothing is changing.
 Hard commits with openSearcher=false don't open new searchers, which
 is why changes aren't visible until a softCommit or a hard commit with
 openSearcher=true despite the fact that the segments are closed.

 FWIW,
 Erick

 Best
 Erick



 On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
  docids are 'ephemeral', but i'd still like to build a search cache 
  with

  them (they allow for the fastest joins).
 
  i'm seeing docids keep changing with updates (especially, in the last
 index
  segment) - as per
  https://issues.apache.org/jira/browse/LUCENE-2897
 
  That would be fine, because i could

Re: building custom cache - using lucene docids

2013-11-24 Thread Mikhail Khludnev
Roman,

I don't fully understand your question. After segment is flushed it's never
changed, hence segment-local docids are always the same. Due to merge
segment can gone, its' docs become new ones in another segment.  This is
true for 'global' (Solr-style) docnums, which can flip after merge is
happened in the middle of the segments' chain.
As well you are saying about segmented cache I can propose you to look at
CachingWrapperFilter and NoOpRegenerator as a pattern for such data
structures.



On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi,
 docids are 'ephemeral', but i'd still like to build a search cache with
 them (they allow for the fastest joins).

 i'm seeing docids keep changing with updates (especially, in the last index
 segment) - as per
 https://issues.apache.org/jira/browse/LUCENE-2897

 That would be fine, because i could build the cache from diff (of index
 state) + reading the latest index segment in its entirety. But can I assume
 that docids in other segments (other than the last one) will be relatively
 stable? (ie. when an old doc is deleted, the docid is marked as removed;
 update doc = delete old  create a new docid)?

 thanks

 roman




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: building custom cache - using lucene docids

2013-11-23 Thread Erick Erickson
bq: But can I assume
that docids in other segments (other than the last one) will be relatively
stable?

Kinda. Maybe. Maybe not. It depends on how you define other than the
last one.

The key is that the internal doc IDs may change when segments are
merged. And old segments get merged. Doc IDs will _never_ change
in a segment once it's closed (although as you note they may be
marked as deleted). But that segment may be written to a new segment
when merging and the internal ID for a given document in the new
segment bears no relationship to internal ID in the old segment.

BTW, I think you only really care when opening a new searchers. There is
a UserCache (see solrconfig.xml) that gets notified when a new searcher
is being opened to give it an opportunity to refresh itself, is that useful?

As long as a searcher is open, it's guaranteed that nothing is changing.
Hard commits with openSearcher=false don't open new searchers, which
is why changes aren't visible until a softCommit or a hard commit with
openSearcher=true despite the fact that the segments are closed.

FWIW,
Erick

Best
Erick



On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi,
 docids are 'ephemeral', but i'd still like to build a search cache with
 them (they allow for the fastest joins).

 i'm seeing docids keep changing with updates (especially, in the last index
 segment) - as per
 https://issues.apache.org/jira/browse/LUCENE-2897

 That would be fine, because i could build the cache from diff (of index
 state) + reading the latest index segment in its entirety. But can I assume
 that docids in other segments (other than the last one) will be relatively
 stable? (ie. when an old doc is deleted, the docid is marked as removed;
 update doc = delete old  create a new docid)?

 thanks

 roman



Re: building custom cache - using lucene docids

2013-11-23 Thread Roman Chyla
Hi Erick,
Many thanks for the info. An additional question:

Do i understand you correctly that when two segmets get merged, the docids
(of the original segments) remain the same?

(unless, perhaps in situation, they were merged using the last index
segment which was opened for writing and where the docids could have
suddenly changed in a commit just before the merge)

Yes, you guessed right that I am putting my code into the custom cache - so
it gets notified on index changes. I don't know yet how, but I think I can
find the way to the current active, opened (last) index segment. Which is
actively updated (as opposed to just being merged) -- so my definition of
'not last ones' is: where docids don't change. I'd be grateful if someone
could spot any problem with such assumption.

roman




On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.comwrote:

 bq: But can I assume
 that docids in other segments (other than the last one) will be relatively
 stable?

 Kinda. Maybe. Maybe not. It depends on how you define other than the
 last one.

 The key is that the internal doc IDs may change when segments are
 merged. And old segments get merged. Doc IDs will _never_ change
 in a segment once it's closed (although as you note they may be
 marked as deleted). But that segment may be written to a new segment
 when merging and the internal ID for a given document in the new
 segment bears no relationship to internal ID in the old segment.

 BTW, I think you only really care when opening a new searchers. There is
 a UserCache (see solrconfig.xml) that gets notified when a new searcher
 is being opened to give it an opportunity to refresh itself, is that
 useful?

 As long as a searcher is open, it's guaranteed that nothing is changing.
 Hard commits with openSearcher=false don't open new searchers, which
 is why changes aren't visible until a softCommit or a hard commit with
 openSearcher=true despite the fact that the segments are closed.

 FWIW,
 Erick

 Best
 Erick



 On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi,
  docids are 'ephemeral', but i'd still like to build a search cache with
  them (they allow for the fastest joins).
 
  i'm seeing docids keep changing with updates (especially, in the last
 index
  segment) - as per
  https://issues.apache.org/jira/browse/LUCENE-2897
 
  That would be fine, because i could build the cache from diff (of index
  state) + reading the latest index segment in its entirety. But can I
 assume
  that docids in other segments (other than the last one) will be
 relatively
  stable? (ie. when an old doc is deleted, the docid is marked as removed;
  update doc = delete old  create a new docid)?
 
  thanks
 
  roman