Re: building custom cache - using lucene docids
On Sun, Nov 24, 2013 at 8:31 AM, Erick Erickson erickerick...@gmail.comwrote: bq: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? The original segments are unchanged, segments are _never_ changed after they're closed. But they'll be thrown away. Say you have segment1 and segment2 that get merged into segment3. As soon as the last searcher that is looking at segment1 and segment2 is closed, those two segments will be deleted from your disk. But for any given doc, the docid in segment3 will very likely be different than it was in segment1 or 2. i'm trying to figure this out - i'll have to dig, i suppose. for example, if the docbase (the docid offset per searcher) was stored together with the index segment, that would be an indication of 'relative stability of docids' I think you're reading too much into LUCENE-2897. I'm pretty sure the segment in question is not available to you anyway before this rewrite is done, but freely admit I don't know much about it. i've done tests, committing and overwriting a document and saw (SOLR4.0) that docids are being recycled. I deleted 2 docs, then added a new document and guess what: the new document had the docid of the previously deleted document (but different fields). That was new to me, so I searched and found the LUCENE-2897 which seemed to explain that behaviour. You're probably going to get into the whole PerSegment family of operations, which is something I'm not all that familiar with so I'll leave explanations to others. Thank you, it is useful to get insights from various sides, roman On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Erick, Many thanks for the info. An additional question: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? (unless, perhaps in situation, they were merged using the last index segment which was opened for writing and where the docids could have suddenly changed in a commit just before the merge) Yes, you guessed right that I am putting my code into the custom cache - so it gets notified on index changes. I don't know yet how, but I think I can find the way to the current active, opened (last) index segment. Which is actively updated (as opposed to just being merged) -- so my definition of 'not last ones' is: where docids don't change. I'd be grateful if someone could spot any problem with such assumption. roman On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com wrote: bq: But can I assume that docids in other segments (other than the last one) will be relatively stable? Kinda. Maybe. Maybe not. It depends on how you define other than the last one. The key is that the internal doc IDs may change when segments are merged. And old segments get merged. Doc IDs will _never_ change in a segment once it's closed (although as you note they may be marked as deleted). But that segment may be written to a new segment when merging and the internal ID for a given document in the new segment bears no relationship to internal ID in the old segment. BTW, I think you only really care when opening a new searchers. There is a UserCache (see solrconfig.xml) that gets notified when a new searcher is being opened to give it an opportunity to refresh itself, is that useful? As long as a searcher is open, it's guaranteed that nothing is changing. Hard commits with openSearcher=false don't open new searchers, which is why changes aren't visible until a softCommit or a hard commit with openSearcher=true despite the fact that the segments are closed. FWIW, Erick Best Erick On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could build the cache from diff (of index state) + reading the latest index segment in its entirety. But can I assume that docids in other segments (other than the last one) will be relatively stable? (ie. when an old doc is deleted, the docid is marked as removed; update doc = delete old create a new docid)? thanks roman
Re: building custom cache - using lucene docids
On Sun, Nov 24, 2013 at 10:44 AM, Jack Krupansky j...@basetechnology.comwrote: We should probably talk about internal Lucene document IDs and external or rebased Lucene document IDs. The internal document IDs are always per-segment and never, ever change for that closed segment. But... the application would not normally see these IDs. Usually the externally visible Lucene document IDs have been rebased to add the sum total count of documents (both existing and deleted) of all preceding segments to the document IDs of a given segment, producing a global (across the full index of all segments) Lucene document ID. So, if you have those three segments, with deleted documents in the first two segments, and then merge those first two segments, the externally-visible Lucene document IDs for the third segment will suddenly all be different, shifted lower by the number of deleted documents that were just merged away, even though nothing changed in the third segment itself. That's right, and I'm starting to think that if i keep the segment id and the original offset, i don't need to rebuild that part of the cache, because it has not been rebased (but I can always update the deleted docs). It seems simple so I'm suspecting to find a catch somewhere. but if it works, that could potentially speed up any cache building Do you have information where the docbase of the segment are stored? Or which java class I should start my exploration from? [it is somewhat sprawling complex, so I'm bit lost :)] Maybe these should be called local (to the segment) Lucene document IDs and global (across all segment) Lucene document IDs. Or, maybe internal vs. external is good enough. In short, it is completely safe to use and save Lucene document IDs, but only as long as no merging of segments is performed. Even one tiny merge and all subsequent saved document IDs are invalidated. Be careful with your merge policy - normally merges are happening in the background, automatically. my tests, as per previous email, showed that the last segment docid's are not that stable. I don't know if it matters that I used the RAMDirectory for the test, but the docids were being 'recycled' - the deleted docs were in the previous segment, then suddently their docids were inside newly added documents (so maybe solr/lucene is not counting deleted docs, if they are at the end of a segment...?) i don't know. i'll need to explore the index segments to understand what was going on there, thanks for any possible pointers roman -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Sunday, November 24, 2013 8:31 AM To: solr-user@lucene.apache.org Subject: Re: building custom cache - using lucene docids bq: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? The original segments are unchanged, segments are _never_ changed after they're closed. But they'll be thrown away. Say you have segment1 and segment2 that get merged into segment3. As soon as the last searcher that is looking at segment1 and segment2 is closed, those two segments will be deleted from your disk. But for any given doc, the docid in segment3 will very likely be different than it was in segment1 or 2. I think you're reading too much into LUCENE-2897. I'm pretty sure the segment in question is not available to you anyway before this rewrite is done, but freely admit I don't know much about it. You're probably going to get into the whole PerSegment family of operations, which is something I'm not all that familiar with so I'll leave explanations to others. On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Erick, Many thanks for the info. An additional question: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? (unless, perhaps in situation, they were merged using the last index segment which was opened for writing and where the docids could have suddenly changed in a commit just before the merge) Yes, you guessed right that I am putting my code into the custom cache - so it gets notified on index changes. I don't know yet how, but I think I can find the way to the current active, opened (last) index segment. Which is actively updated (as opposed to just being merged) -- so my definition of 'not last ones' is: where docids don't change. I'd be grateful if someone could spot any problem with such assumption. roman On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com wrote: bq: But can I assume that docids in other segments (other than the last one) will be relatively stable? Kinda. Maybe. Maybe not. It depends on how you define other than the last one. The key is that the internal doc IDs may change when segments are merged. And old segments get merged. Doc IDs will _never_ change
Re: building custom cache - using lucene docids
On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Roman, I don't fully understand your question. After segment is flushed it's never changed, hence segment-local docids are always the same. Due to merge segment can gone, its' docs become new ones in another segment. This is true for 'global' (Solr-style) docnums, which can flip after merge is happened in the middle of the segments' chain. As well you are saying about segmented cache I can propose you to look at CachingWrapperFilter and NoOpRegenerator as a pattern for such data structures. Thanks Mikhail, the CWF confirms that the idea of regenerating just part of the cache is doable. The CacheRegenerators, on the other hand, make no sense to me - and they are not given any 'signals', so they don't know if they are in the middle of some regeneration or not, and they should not keep a state (of previous index) - as they can be shared by threads that build the cache Best, roman On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could build the cache from diff (of index state) + reading the latest index segment in its entirety. But can I assume that docids in other segments (other than the last one) will be relatively stable? (ie. when an old doc is deleted, the docid is marked as removed; update doc = delete old create a new docid)? thanks roman -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: building custom cache - using lucene docids
OK, I've spent some time reading the solr/lucene4x classes, and this is myunderstanding (feel free to correct me ;-)) DirectoryReader holds the opened segments -- each segment has its own reader, the BaseCompositeReader (or extended classes thereof) store the offsets per each segment; eg. [0, 5, 22] - meaning, there are 2 segments, with 5, and 17 docs respectively The segments are listed in the segments_N file, http://lucene.apache.org/core/3_0_3/fileformats.html#Segments File So theoretically, order of segments could change when merge happens - yet, every SegmentReader is identified by unique name and this name doesn't change unless the segment itself changed (ie. docs were deleted; or got more docs) - so it is possible to rely on this name to know what has not changed the name is coming from SegmentInfo (check its toString method) -- the SegmentInfo has a method equals() that will consider as equal the readers with the same name and the same dir (which is useful to know - two readers, one with deletes, one without, are equal) Lucene's FieldCache itself is rather complex, but it shows there is a very clever mechanism (a few actually!) -- a class can register a listener that will be called whenever an index segments is being closed (this could be used to invalidate portions of a cache), the relevant classes are: SegmentReader.CoreClosedListener, IndexReader.ReaderClosedListener But Lucene is using this mechanism only to purge the cache - so effectively, every commits triggers cache rebuild. This is the interesting bit: lots of work could be spared if segments data were reused (but admittedly, only sometimes - for data that was fully read into memory, for anything else, such as terms, the cache reads only some values and is fetching the rest from the index - so Lucene must close the reader and rebuild the cache on every commit; but that is not my case, as I am to copy values from an index, and store them in memory...) the weird 'recyclation' of docids I've observed can probably be explained by the fact that the index reader contains segments and near realtime readers (but I'm not sure about this) To conclude: it is possible to build a cache that updates itself (with only changes committed since the last build) - this will have impact on how fast new searcher is ready to serve requests HTH somebody else too :) roman On Mon, Nov 25, 2013 at 7:54 PM, Roman Chyla roman.ch...@gmail.com wrote: On Mon, Nov 25, 2013 at 12:54 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Roman, I don't fully understand your question. After segment is flushed it's never changed, hence segment-local docids are always the same. Due to merge segment can gone, its' docs become new ones in another segment. This is true for 'global' (Solr-style) docnums, which can flip after merge is happened in the middle of the segments' chain. As well you are saying about segmented cache I can propose you to look at CachingWrapperFilter and NoOpRegenerator as a pattern for such data structures. Thanks Mikhail, the CWF confirms that the idea of regenerating just part of the cache is doable. The CacheRegenerators, on the other hand, make no sense to me - and they are not given any 'signals', so they don't know if they are in the middle of some regeneration or not, and they should not keep a state (of previous index) - as they can be shared by threads that build the cache Best, roman On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could build the cache from diff (of index state) + reading the latest index segment in its entirety. But can I assume that docids in other segments (other than the last one) will be relatively stable? (ie. when an old doc is deleted, the docid is marked as removed; update doc = delete old create a new docid)? thanks roman -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: building custom cache - using lucene docids
bq: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? The original segments are unchanged, segments are _never_ changed after they're closed. But they'll be thrown away. Say you have segment1 and segment2 that get merged into segment3. As soon as the last searcher that is looking at segment1 and segment2 is closed, those two segments will be deleted from your disk. But for any given doc, the docid in segment3 will very likely be different than it was in segment1 or 2. I think you're reading too much into LUCENE-2897. I'm pretty sure the segment in question is not available to you anyway before this rewrite is done, but freely admit I don't know much about it. You're probably going to get into the whole PerSegment family of operations, which is something I'm not all that familiar with so I'll leave explanations to others. On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Erick, Many thanks for the info. An additional question: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? (unless, perhaps in situation, they were merged using the last index segment which was opened for writing and where the docids could have suddenly changed in a commit just before the merge) Yes, you guessed right that I am putting my code into the custom cache - so it gets notified on index changes. I don't know yet how, but I think I can find the way to the current active, opened (last) index segment. Which is actively updated (as opposed to just being merged) -- so my definition of 'not last ones' is: where docids don't change. I'd be grateful if someone could spot any problem with such assumption. roman On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com wrote: bq: But can I assume that docids in other segments (other than the last one) will be relatively stable? Kinda. Maybe. Maybe not. It depends on how you define other than the last one. The key is that the internal doc IDs may change when segments are merged. And old segments get merged. Doc IDs will _never_ change in a segment once it's closed (although as you note they may be marked as deleted). But that segment may be written to a new segment when merging and the internal ID for a given document in the new segment bears no relationship to internal ID in the old segment. BTW, I think you only really care when opening a new searchers. There is a UserCache (see solrconfig.xml) that gets notified when a new searcher is being opened to give it an opportunity to refresh itself, is that useful? As long as a searcher is open, it's guaranteed that nothing is changing. Hard commits with openSearcher=false don't open new searchers, which is why changes aren't visible until a softCommit or a hard commit with openSearcher=true despite the fact that the segments are closed. FWIW, Erick Best Erick On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could build the cache from diff (of index state) + reading the latest index segment in its entirety. But can I assume that docids in other segments (other than the last one) will be relatively stable? (ie. when an old doc is deleted, the docid is marked as removed; update doc = delete old create a new docid)? thanks roman
Re: building custom cache - using lucene docids
We should probably talk about internal Lucene document IDs and external or rebased Lucene document IDs. The internal document IDs are always per-segment and never, ever change for that closed segment. But... the application would not normally see these IDs. Usually the externally visible Lucene document IDs have been rebased to add the sum total count of documents (both existing and deleted) of all preceding segments to the document IDs of a given segment, producing a global (across the full index of all segments) Lucene document ID. So, if you have those three segments, with deleted documents in the first two segments, and then merge those first two segments, the externally-visible Lucene document IDs for the third segment will suddenly all be different, shifted lower by the number of deleted documents that were just merged away, even though nothing changed in the third segment itself. Maybe these should be called local (to the segment) Lucene document IDs and global (across all segment) Lucene document IDs. Or, maybe internal vs. external is good enough. In short, it is completely safe to use and save Lucene document IDs, but only as long as no merging of segments is performed. Even one tiny merge and all subsequent saved document IDs are invalidated. Be careful with your merge policy - normally merges are happening in the background, automatically. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Sunday, November 24, 2013 8:31 AM To: solr-user@lucene.apache.org Subject: Re: building custom cache - using lucene docids bq: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? The original segments are unchanged, segments are _never_ changed after they're closed. But they'll be thrown away. Say you have segment1 and segment2 that get merged into segment3. As soon as the last searcher that is looking at segment1 and segment2 is closed, those two segments will be deleted from your disk. But for any given doc, the docid in segment3 will very likely be different than it was in segment1 or 2. I think you're reading too much into LUCENE-2897. I'm pretty sure the segment in question is not available to you anyway before this rewrite is done, but freely admit I don't know much about it. You're probably going to get into the whole PerSegment family of operations, which is something I'm not all that familiar with so I'll leave explanations to others. On Sat, Nov 23, 2013 at 8:22 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Erick, Many thanks for the info. An additional question: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? (unless, perhaps in situation, they were merged using the last index segment which was opened for writing and where the docids could have suddenly changed in a commit just before the merge) Yes, you guessed right that I am putting my code into the custom cache - so it gets notified on index changes. I don't know yet how, but I think I can find the way to the current active, opened (last) index segment. Which is actively updated (as opposed to just being merged) -- so my definition of 'not last ones' is: where docids don't change. I'd be grateful if someone could spot any problem with such assumption. roman On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.com wrote: bq: But can I assume that docids in other segments (other than the last one) will be relatively stable? Kinda. Maybe. Maybe not. It depends on how you define other than the last one. The key is that the internal doc IDs may change when segments are merged. And old segments get merged. Doc IDs will _never_ change in a segment once it's closed (although as you note they may be marked as deleted). But that segment may be written to a new segment when merging and the internal ID for a given document in the new segment bears no relationship to internal ID in the old segment. BTW, I think you only really care when opening a new searchers. There is a UserCache (see solrconfig.xml) that gets notified when a new searcher is being opened to give it an opportunity to refresh itself, is that useful? As long as a searcher is open, it's guaranteed that nothing is changing. Hard commits with openSearcher=false don't open new searchers, which is why changes aren't visible until a softCommit or a hard commit with openSearcher=true despite the fact that the segments are closed. FWIW, Erick Best Erick On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could
Re: building custom cache - using lucene docids
Roman, I don't fully understand your question. After segment is flushed it's never changed, hence segment-local docids are always the same. Due to merge segment can gone, its' docs become new ones in another segment. This is true for 'global' (Solr-style) docnums, which can flip after merge is happened in the middle of the segments' chain. As well you are saying about segmented cache I can propose you to look at CachingWrapperFilter and NoOpRegenerator as a pattern for such data structures. On Sat, Nov 23, 2013 at 9:40 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could build the cache from diff (of index state) + reading the latest index segment in its entirety. But can I assume that docids in other segments (other than the last one) will be relatively stable? (ie. when an old doc is deleted, the docid is marked as removed; update doc = delete old create a new docid)? thanks roman -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: building custom cache - using lucene docids
bq: But can I assume that docids in other segments (other than the last one) will be relatively stable? Kinda. Maybe. Maybe not. It depends on how you define other than the last one. The key is that the internal doc IDs may change when segments are merged. And old segments get merged. Doc IDs will _never_ change in a segment once it's closed (although as you note they may be marked as deleted). But that segment may be written to a new segment when merging and the internal ID for a given document in the new segment bears no relationship to internal ID in the old segment. BTW, I think you only really care when opening a new searchers. There is a UserCache (see solrconfig.xml) that gets notified when a new searcher is being opened to give it an opportunity to refresh itself, is that useful? As long as a searcher is open, it's guaranteed that nothing is changing. Hard commits with openSearcher=false don't open new searchers, which is why changes aren't visible until a softCommit or a hard commit with openSearcher=true despite the fact that the segments are closed. FWIW, Erick Best Erick On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could build the cache from diff (of index state) + reading the latest index segment in its entirety. But can I assume that docids in other segments (other than the last one) will be relatively stable? (ie. when an old doc is deleted, the docid is marked as removed; update doc = delete old create a new docid)? thanks roman
Re: building custom cache - using lucene docids
Hi Erick, Many thanks for the info. An additional question: Do i understand you correctly that when two segmets get merged, the docids (of the original segments) remain the same? (unless, perhaps in situation, they were merged using the last index segment which was opened for writing and where the docids could have suddenly changed in a commit just before the merge) Yes, you guessed right that I am putting my code into the custom cache - so it gets notified on index changes. I don't know yet how, but I think I can find the way to the current active, opened (last) index segment. Which is actively updated (as opposed to just being merged) -- so my definition of 'not last ones' is: where docids don't change. I'd be grateful if someone could spot any problem with such assumption. roman On Sat, Nov 23, 2013 at 7:39 PM, Erick Erickson erickerick...@gmail.comwrote: bq: But can I assume that docids in other segments (other than the last one) will be relatively stable? Kinda. Maybe. Maybe not. It depends on how you define other than the last one. The key is that the internal doc IDs may change when segments are merged. And old segments get merged. Doc IDs will _never_ change in a segment once it's closed (although as you note they may be marked as deleted). But that segment may be written to a new segment when merging and the internal ID for a given document in the new segment bears no relationship to internal ID in the old segment. BTW, I think you only really care when opening a new searchers. There is a UserCache (see solrconfig.xml) that gets notified when a new searcher is being opened to give it an opportunity to refresh itself, is that useful? As long as a searcher is open, it's guaranteed that nothing is changing. Hard commits with openSearcher=false don't open new searchers, which is why changes aren't visible until a softCommit or a hard commit with openSearcher=true despite the fact that the segments are closed. FWIW, Erick Best Erick On Sat, Nov 23, 2013 at 12:40 AM, Roman Chyla roman.ch...@gmail.com wrote: Hi, docids are 'ephemeral', but i'd still like to build a search cache with them (they allow for the fastest joins). i'm seeing docids keep changing with updates (especially, in the last index segment) - as per https://issues.apache.org/jira/browse/LUCENE-2897 That would be fine, because i could build the cache from diff (of index state) + reading the latest index segment in its entirety. But can I assume that docids in other segments (other than the last one) will be relatively stable? (ie. when an old doc is deleted, the docid is marked as removed; update doc = delete old create a new docid)? thanks roman