(Commit=all!!
Sent from my BlackBerry® smartphone

-----Original Message-----
From: Vitaly Funstein <vfunst...@gmail.com>
Date: Thu, 28 Aug 2014 13:18:08 
To: <java-user@lucene.apache.org>
Reply-To: java-user@lucene.apache.org
Subject: Re: BlockTreeTermsReader consumes crazy amount of memory

Thanks for the suggestions! I'll file an enhancement request.

But I am still a little skeptical about the approach of "pooling" segment
readers from prior DirectoryReader instances, opened at earlier commit
points. It looks like the up to date check for non-NRT directory reader
just compares the segment infos file names, and since each commit will
create a new SI file, doesn't this make the check moot?

  private DirectoryReader doOpenNoWriter(IndexCommit commit) throws
IOException {

    if (commit == null) {
      if (isCurrent()) {
        return null;
      }
    } else {
      if (directory != commit.getDirectory()) {
        throw new IOException("the specified commit does not match the
specified Directory");
      }
      if (segmentInfos != null &&
commit.getSegmentsFileName().equals(segmentInfos.getSegmentsFileName())) {
        return null;
      }
    }

    return doOpenFromCommit(commit);
  }

As for tuning the block size - would you recommend increasing it to
BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE, or close to it? And if I did
this, would I have readability issues for indices created before this
change? We are already using a customized codec though, so perhaps adding
this to the codec is okay and transparent?


On Thu, Aug 28, 2014 at 12:49 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Ugh, you're right: this still won't re-use from IW's reader pool.  Can
> you open an issue?  Somehow we should make this easier.
>
> In the meantime, I guess you can use openIfChanged from your "back in
> time" reader to open another "back in time" reader.  This way you have
> two pools... IW's pool for the series of NRT readers, and another pool
> shared by the "back in time" readers ... but we should somehow fix
> this so it's one pool.
>
> OK looks like it's the FST terms index, and yes synthetic terms gives
> you synthetic results :)  However, to reduce the FST ram here you can
> just increase the block sizes uses by the terms index (see
> BlockTreeTermsWriter).  Larger blocks = smaller terms index (FST) but
> possibly slower searches, especially MultiTermQueries ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein <vfunst...@gmail.com>
> wrote:
> > Thanks, Mike - I think the issue is actually the latter, i.e.
> SegmentReader
> > on its own can certainly use enough heap to cause problems, which of
> course
> > would be made that much worse by failure to pool readers for unchanged
> > segments.
> >
> > But where are you seeing the behavior that would result in reuse of
> > SegmentReaders from the pool inside the index writer? If I'm reading the
> > code right here, here's what it calls:
> >
> >   protected DirectoryReader doOpenIfChanged(final IndexCommit commit)
> > throws IOException {
> >     ensureOpen();
> >
> >     // If we were obtained by writer.getReader(), re-ask the
> >     // writer to get a new reader.
> >     if (writer != null) {
> >       return doOpenFromWriter(commit);
> >     } else {
> >       return doOpenNoWriter(commit);
> >     }
> >   }
> >
> >   private DirectoryReader doOpenFromWriter(IndexCommit commit) throws
> > IOException {
> >     if (commit != null) {
> >       return doOpenFromCommit(commit);
> >     }
> > ......
> >
> > There is no attempt made to inspect the segments inside the commit point
> > here, for possible reader pool reuse.
> >
> > So here's a drill down into the SegmentReader memory foot print. There
> > aren't actually 88 fields here - rather, this number reflects the
> "shallow"
> > heap size of BlockTreeTermsReader instance, i.e. calculated size without
> > following any the references from it (at depth 0).
> >
> >
> https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing
> >
> > I suppose totally randomly generated field values are a bit of a
> contrived
> > use case, since in a real world there will be far less randomness to
> each,
> > but perhaps this gives us an idea for the worst case scenario... just
> > guessing though.
> >
> >
> >
> > On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Can you drill down some more to see what's using those ~46 MB?  Is the
> >> the FSTs in the terms index?
> >>
> >> But, we need to decouple the "single segment is opened with multiple
> >> SegmentReaders" from e.g. "single SegmentReader is using too much RAM
> >> to hold terms index".  E.g. from this screen shot it looks like there
> >> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein <vfunst...@gmail.com>
> >> wrote:
> >> > Here's the link:
> >> >
> >>
> https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
> >> >
> >> > I'm indexing let's say 11 unique fields per document. Also, the NRT
> >> reader
> >> > is opened continually, and "regular" searches use that one. But a
> special
> >> > kind of feature allows searching a particular point in time (they get
> >> > cleaned out based on some other logic), which requires opening a
> non-NRT
> >> > reader just to service such search requests - in my understanding no
> >> > segment readers for this reader can be shared with the NRT reader's
> >> pool...
> >> > or am I off here? This seems evident from another heap dump fragment
> that
> >> > shows a full new set of segment readers attached to that "temporary"
> >> > reader:
> >> >
> >> >
> >>
> https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing
> >> >
> >> >
> >> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
> >> > luc...@mikemccandless.com> wrote:
> >> >
> >> >> Hmm screen shot didn't make it ... can you post link?
> >> >>
> >> >> If you are using NRT reader then when a new one is opened, it won't
> >> >> open new SegmentReaders for all segments, just for newly
> >> >> flushed/merged segments since the last reader was opened.  So for
> your
> >> >> N commit points that you have readers open for, they will be sharing
> >> >> SegmentReaders for segments they have in common.
> >> >>
> >> >> How many unique fields are you adding?
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >>
> >> >>
> >> >> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein <
> vfunst...@gmail.com>
> >> >> wrote:
> >> >> > Mike,
> >> >> >
> >> >> > Here's the screenshot; not sure if it will go through as an
> attachment
> >> >> > though - if not, I'll post it as a link. Please ignore the altered
> >> >> package
> >> >> > names, since Lucene is shaded in as part of our build process.
> >> >> >
> >> >> > Some more context about the use case. Yes, the terms are pretty
> much
> >> >> unique;
> >> >> > the schema for the data set is actually borrowed from here:
> >> >> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the
> >> UserVisits
> >> >> > set, with a couple of other fields added by us. The values for the
> >> fields
> >> >> > are generated almost randomly, though some string fields are
> picked at
> >> >> > random from a fixed dictionary.
> >> >> >
> >> >> > Also, this type of heap footprint might be tolerable if it stayed
> >> >> relatively
> >> >> > constant throughout the system's life cycle (of course, given the
> >> index
> >> >> set
> >> >> > stays more or less static). However, what happens here is that one
> >> >> > IndexReader reference is maintained by ReaderManager as an NRT
> reader.
> >> >> But
> >> >> > we also would like support an ability to execute searches against
> >> >> specific
> >> >> > index commit points, ideally in parallel. As you might imagine, as
> >> soon
> >> >> as a
> >> >> > new DirectoryReader is opened at a given commit, a whole new set of
> >> >> > SegmentReader instances is created and populated, effectively
> doubling
> >> >> the
> >> >> > already large heap usage... if there was a way to somehow reuse
> >> readers
> >> >> for
> >> >> > unchanged segments already pooled by IndexWriter, that would help
> >> >> > tremendously here. But I don't think there's a way to link up the
> two
> >> >> sets,
> >> >> > at least not in the Lucene version we are using (4.6.1) - is this
> >> >> correct?
> >> >> >
> >> >> >
> >> >> > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
> >> >> > <luc...@mikemccandless.com> wrote:
> >> >> >>
> >> >> >> This is surprising: unless you have an excessive number of unique
> >> >> >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
> >> >> >>
> >> >> >> Bu you only have 12 unique fields?
> >> >> >>
> >> >> >> Can you post screen shots of the heap usage?
> >> >> >>
> >> >> >> Mike McCandless
> >> >> >>
> >> >> >> http://blog.mikemccandless.com
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein <
> >> vfunst...@gmail.com>
> >> >> >> wrote:
> >> >> >> > This is a follow up to the earlier thread I started to
> understand
> >> >> memory
> >> >> >> > usage patterns of SegmentReader instances, but I decided to
> create
> >> a
> >> >> >> > separate post since this issue is much more serious than the
> heap
> >> >> >> > overhead
> >> >> >> > created by use of stored field compression.
> >> >> >> >
> >> >> >> > Here is the use case, once again. The index totals around 300M
> >> >> >> > documents,
> >> >> >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields
> which
> >> are
> >> >> >> > both
> >> >> >> > indexed and stored. It is split into 4 shards, which are
> basically
> >> >> >> > separate
> >> >> >> > indices... if that matters. After the index is populated (but
> not
> >> >> >> > optimized
> >> >> >> > since we don't do that), the overall heap usage taken up by
> Lucene
> >> is
> >> >> >> > over
> >> >> >> > 1 GB, much of which is taken up by instances of
> >> BlockTreeTermsReader.
> >> >> >> > For
> >> >> >> > instance for the largest segment in one such an index, the
> retained
> >> >> heap
> >> >> >> > size of the internal tree map is around 50 MB. This is evident
> from
> >> >> heap
> >> >> >> > dump analysis, which I have screenshots of that I can post
> here, if
> >> >> that
> >> >> >> > helps. As there are many segments of various sizes in the
> index, as
> >> >> >> > expected, the total heap usage for one shard stands at around
> 280
> >> MB.
> >> >> >> >
> >> >> >> > Could someone shed some light on whether this is expected, and
> if
> >> so -
> >> >> >> > how
> >> >> >> > could I possibly trim down memory usage here? Is there a way to
> >> switch
> >> >> >> > to a
> >> >> >> > different terms index implementation, one that doesn't preload
> all
> >> the
> >> >> >> > terms into RAM, or only does this partially, i.e. as a cache?
> I'm
> >> not
> >> >> >> > sure
> >> >> >> > if I'm framing my questions correctly, as I'm obviously not an
> >> expert
> >> >> on
> >> >> >> > Lucene's internals, but this is going to become a critical issue
> >> for
> >> >> >> > large
> >> >> >> > scale use cases of our system.
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to