Re: BlockTreeTermsReader consumes crazy amount of memory

vfunstein Tue, 09 Sep 2014 15:07:08 -0700

I'm on 4.6.1. I'll file an issue for sure, but is there a workaround you could 
think of in the meantime? As you probably remember, the reason for doing this 
in the first place was to prevent the catastrophic heap exhaustion when 
SegmentReader instances are opened from scratch for every new IndexReader 
created.


Coming to think of it, I think this issue can be observed even when the "old" 
reader is not NRT but even another commit point. 

But just to check my understanding of Lucene in general - was my hunch correct 
in that deletes can be present as a result of segment merging and not actual 
user-driven index updates?

> On Sep 9, 2014, at 10:46 AM, Michael McCandless <[email protected]> 
> wrote:
> 
> Hmm, which Lucene version are you using?  We recently beefed up the
> checking in this code, so you ought to be hitting an exception in
> newer versions.
> 
> But that being said, I think the bug is real: if you try to reopen
> from a newer NRT reader down to an older (commit point) reader then
> you can hit this.
> 
> Can you open an issue and maybe post a test case showing it?  Thanks.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
>> On Tue, Sep 9, 2014 at 2:30 AM, Vitaly Funstein <[email protected]> wrote:
>> I think I see the bug here, but maybe I'm wrong. Here's my theory:
>> 
>> Suppose no segments at a particular commit point contain any deletes. Now,
>> we also hold open an NRT reader into the index, which may end up with some
>> deletes, after the commit occurred. Then, according to the following
>> conditional in StandardDirectoryReader, we shall get into the two arg ctor
>> of SegmentReader:
>> 
>>            if (newReaders[i].getSegmentInfo().getDelGen() ==
>> infos.info(i).getDelGen())
>> {
>>              // only DV updates
>>              newReaders[i] = new SegmentReader(infos.info(i),
>> newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs());
>>            } else {
>>              // both DV and liveDocs have changed
>>              newReaders[i] = new SegmentReader(infos.info(i),
>> newReaders[i]);
>>            }
>> 
>> That constructor looks like this:
>> 
>>  SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws IOException {
>>    this(si, sr,
>>         si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si,
>> IOContext.READONCE),
>>         si.info.getDocCount() - si.getDelCount());
>>  }
>> 
>> At this point, the SegmentInfo we're tryng to read live docs on is from the
>> commit point, and if there weren't any deletes, then the following results
>> in a null for the relevant file name, in
>> Lucene40LiveDocsFormat.readLiveDocs():
>> 
>>    String filename = IndexFileNames.fileNameFromGeneration(info.info.name,
>> DELETES_EXTENSION, info.getDelGen());
>>    final BitVector liveDocs = new BitVector(dir, filename, context);
>> 
>> This is where filename ends up being null, which gets passed all the way
>> down to the File constructor.
>> 
>> In a nutshell, I think the bug is that it is assumed that the segments from
>> commit point have deletes, when they may not, yet the original
>> SegmentReader for the segment that we are trying to reuse does.
>> 
>> What I am not quite clear yet is how we arrive at this point, because the
>> test that causes the exception doesn't do any deletes/updates... could
>> deletes occur as a result of a segment merge? This might explain the
>> sporadic nature of this exception, since merge timings aren't deterministic.
>> 
>> 
>> On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <[email protected]>
>> wrote:
>> 
>>> UPDATE:
>>> 
>>> After making the changes we discussed to enable sharing of SegmentReaders
>>> between the NRT reader and a commit point reader, specifically calling
>>> through to DirectoryReader.openIfChanged(DirectoryReader, IndexCommit), I
>>> am seeing this exception, sporadically:
>>> 
>>> Caused by: java.lang.NullPointerException
>>>        at java.io.File.<init>(File.java:305)
>>>        at
>>> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
>>>        at
>>> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
>>>        at
>>> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
>>>        at
>>> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
>>>        at
>>> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
>>>        at
>>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
>>>        at
>>> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)
>>> 
>>> Looking at the source quickly, it appears the child argument to the File
>>> ctor is null; is it somehow possible that the segment infos in the commit
>>> point wasn't fully written out somehow, on prior commit? Sounds unlikely,
>>> yet disturbing... but nothing else has changed in my code, i.e. the way
>>> commits are performed and indexes are reopened.
>>> 
>>> 
>>> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
>>> [email protected]> wrote:
>>> 
>>>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <[email protected]>
>>>> wrote:
>>>>> On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> 
>>>>>> The segments_N file can be different, that's fine: after that, we then
>>>>>> re-use SegmentReaders when they are in common between the two commit
>>>>>> points.  Each segments_N file refers to many segments...
>>>>> Yes, you are totally right - I didn't follow the code far enough the
>>>> first
>>>>> time around. :) This is an excellent idea, actually - I can probably
>>>>> arrange maintained commit points as an MRU data structure (e.g.
>>>>> LinkedHashMap with access order), and simply grab the most recently
>>>> opened
>>>>> reader to pass in when obtaining a new one from the new commit point -
>>>> to
>>>>> maximize segment reader reuse.
>>>> 
>>>> That's great!
>>>> 
>>>>>> You can set it (min and max) as high as you want; the only hard
>>>>>> requirement is that max >= 2*(min-1), I believe.
>>>>> 
>>>>> Looks like this is used inside Lucene41PostingsFormat, which simply
>>>> passes
>>>>> in those defaults - so you are effectively saying the minimum (and
>>>>> therefore, maximum) block size can be raised to reuse the size of the
>>>> terms
>>>>> index inside those TreeMap nodes?
>>>> 
>>>> Yes, but it then increases cost at search time to locate a given term,
>>>> because more scanning is then required once we seek to the block that
>>>> might have the term.
>>>> 
>>>> This reduces the size of the FST, but if RAM is being used by
>>>> something else inside BT, it won't help.  But from your screen shot it
>>>> looked like it was almost entirely the FST, which is what I would
>>>> expect.
>>>> 
>>>>>>> We are already using a customized codec though, so perhaps adding
>>>>>>> this to the codec is okay and transparent?
>>>>>> 
>>>>>> Hmmm :)  Customized in what manner?
>>>>> We need to have the ability to turn off stored fields compression, so
>>>> there
>>>>> is one codec in case the system is configured that way. The other one
>>>>> exists for compression on, but there I tweaked stored fields format for
>>>>> bias toward decompression, as well as a smaller chunk size - based on
>>>> some
>>>>> empirical observations in executed tests. I am guessing I'll just add
>>>>> another customization to both that deals with the block sizing for
>>>> postings
>>>>> format, and see what difference that makes...
>>>> 
>>>> Ahh, OK.  Yes, just add this custom terms index block sizing too.
>>>> 
>>>> Mike McCandless
>>>> 
>>>> http://blog.mikemccandless.com
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: BlockTreeTermsReader consumes crazy amount of memory

Reply via email to