(Commit=all!! Sent from my BlackBerry® smartphone -----Original Message----- From: Vitaly Funstein <vfunst...@gmail.com> Date: Thu, 28 Aug 2014 13:18:08 To: <java-user@lucene.apache.org> Reply-To: java-user@lucene.apache.org Subject: Re: BlockTreeTermsReader consumes crazy amount of memory
Thanks for the suggestions! I'll file an enhancement request. But I am still a little skeptical about the approach of "pooling" segment readers from prior DirectoryReader instances, opened at earlier commit points. It looks like the up to date check for non-NRT directory reader just compares the segment infos file names, and since each commit will create a new SI file, doesn't this make the check moot? private DirectoryReader doOpenNoWriter(IndexCommit commit) throws IOException { if (commit == null) { if (isCurrent()) { return null; } } else { if (directory != commit.getDirectory()) { throw new IOException("the specified commit does not match the specified Directory"); } if (segmentInfos != null && commit.getSegmentsFileName().equals(segmentInfos.getSegmentsFileName())) { return null; } } return doOpenFromCommit(commit); } As for tuning the block size - would you recommend increasing it to BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE, or close to it? And if I did this, would I have readability issues for indices created before this change? We are already using a customized codec though, so perhaps adding this to the codec is okay and transparent? On Thu, Aug 28, 2014 at 12:49 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Ugh, you're right: this still won't re-use from IW's reader pool. Can > you open an issue? Somehow we should make this easier. > > In the meantime, I guess you can use openIfChanged from your "back in > time" reader to open another "back in time" reader. This way you have > two pools... IW's pool for the series of NRT readers, and another pool > shared by the "back in time" readers ... but we should somehow fix > this so it's one pool. > > OK looks like it's the FST terms index, and yes synthetic terms gives > you synthetic results :) However, to reduce the FST ram here you can > just increase the block sizes uses by the terms index (see > BlockTreeTermsWriter). Larger blocks = smaller terms index (FST) but > possibly slower searches, especially MultiTermQueries ... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Funstein <vfunst...@gmail.com> > wrote: > > Thanks, Mike - I think the issue is actually the latter, i.e. > SegmentReader > > on its own can certainly use enough heap to cause problems, which of > course > > would be made that much worse by failure to pool readers for unchanged > > segments. > > > > But where are you seeing the behavior that would result in reuse of > > SegmentReaders from the pool inside the index writer? If I'm reading the > > code right here, here's what it calls: > > > > protected DirectoryReader doOpenIfChanged(final IndexCommit commit) > > throws IOException { > > ensureOpen(); > > > > // If we were obtained by writer.getReader(), re-ask the > > // writer to get a new reader. > > if (writer != null) { > > return doOpenFromWriter(commit); > > } else { > > return doOpenNoWriter(commit); > > } > > } > > > > private DirectoryReader doOpenFromWriter(IndexCommit commit) throws > > IOException { > > if (commit != null) { > > return doOpenFromCommit(commit); > > } > > ...... > > > > There is no attempt made to inspect the segments inside the commit point > > here, for possible reader pool reuse. > > > > So here's a drill down into the SegmentReader memory foot print. There > > aren't actually 88 fields here - rather, this number reflects the > "shallow" > > heap size of BlockTreeTermsReader instance, i.e. calculated size without > > following any the references from it (at depth 0). > > > > > https://drive.google.com/file/d/0B5eRTXMELFjjVmxLejQzazVPZzA/edit?usp=sharing > > > > I suppose totally randomly generated field values are a bit of a > contrived > > use case, since in a real world there will be far less randomness to > each, > > but perhaps this gives us an idea for the worst case scenario... just > > guessing though. > > > > > > > > On Thu, Aug 28, 2014 at 11:28 AM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >> Can you drill down some more to see what's using those ~46 MB? Is the > >> the FSTs in the terms index? > >> > >> But, we need to decouple the "single segment is opened with multiple > >> SegmentReaders" from e.g. "single SegmentReader is using too much RAM > >> to hold terms index". E.g. from this screen shot it looks like there > >> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ... > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> > >> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein <vfunst...@gmail.com> > >> wrote: > >> > Here's the link: > >> > > >> > https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing > >> > > >> > I'm indexing let's say 11 unique fields per document. Also, the NRT > >> reader > >> > is opened continually, and "regular" searches use that one. But a > special > >> > kind of feature allows searching a particular point in time (they get > >> > cleaned out based on some other logic), which requires opening a > non-NRT > >> > reader just to service such search requests - in my understanding no > >> > segment readers for this reader can be shared with the NRT reader's > >> pool... > >> > or am I off here? This seems evident from another heap dump fragment > that > >> > shows a full new set of segment readers attached to that "temporary" > >> > reader: > >> > > >> > > >> > https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing > >> > > >> > > >> > On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless < > >> > luc...@mikemccandless.com> wrote: > >> > > >> >> Hmm screen shot didn't make it ... can you post link? > >> >> > >> >> If you are using NRT reader then when a new one is opened, it won't > >> >> open new SegmentReaders for all segments, just for newly > >> >> flushed/merged segments since the last reader was opened. So for > your > >> >> N commit points that you have readers open for, they will be sharing > >> >> SegmentReaders for segments they have in common. > >> >> > >> >> How many unique fields are you adding? > >> >> > >> >> Mike McCandless > >> >> > >> >> http://blog.mikemccandless.com > >> >> > >> >> > >> >> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein < > vfunst...@gmail.com> > >> >> wrote: > >> >> > Mike, > >> >> > > >> >> > Here's the screenshot; not sure if it will go through as an > attachment > >> >> > though - if not, I'll post it as a link. Please ignore the altered > >> >> package > >> >> > names, since Lucene is shaded in as part of our build process. > >> >> > > >> >> > Some more context about the use case. Yes, the terms are pretty > much > >> >> unique; > >> >> > the schema for the data set is actually borrowed from here: > >> >> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the > >> UserVisits > >> >> > set, with a couple of other fields added by us. The values for the > >> fields > >> >> > are generated almost randomly, though some string fields are > picked at > >> >> > random from a fixed dictionary. > >> >> > > >> >> > Also, this type of heap footprint might be tolerable if it stayed > >> >> relatively > >> >> > constant throughout the system's life cycle (of course, given the > >> index > >> >> set > >> >> > stays more or less static). However, what happens here is that one > >> >> > IndexReader reference is maintained by ReaderManager as an NRT > reader. > >> >> But > >> >> > we also would like support an ability to execute searches against > >> >> specific > >> >> > index commit points, ideally in parallel. As you might imagine, as > >> soon > >> >> as a > >> >> > new DirectoryReader is opened at a given commit, a whole new set of > >> >> > SegmentReader instances is created and populated, effectively > doubling > >> >> the > >> >> > already large heap usage... if there was a way to somehow reuse > >> readers > >> >> for > >> >> > unchanged segments already pooled by IndexWriter, that would help > >> >> > tremendously here. But I don't think there's a way to link up the > two > >> >> sets, > >> >> > at least not in the Lucene version we are using (4.6.1) - is this > >> >> correct? > >> >> > > >> >> > > >> >> > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless > >> >> > <luc...@mikemccandless.com> wrote: > >> >> >> > >> >> >> This is surprising: unless you have an excessive number of unique > >> >> >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer. > >> >> >> > >> >> >> Bu you only have 12 unique fields? > >> >> >> > >> >> >> Can you post screen shots of the heap usage? > >> >> >> > >> >> >> Mike McCandless > >> >> >> > >> >> >> http://blog.mikemccandless.com > >> >> >> > >> >> >> > >> >> >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein < > >> vfunst...@gmail.com> > >> >> >> wrote: > >> >> >> > This is a follow up to the earlier thread I started to > understand > >> >> memory > >> >> >> > usage patterns of SegmentReader instances, but I decided to > create > >> a > >> >> >> > separate post since this issue is much more serious than the > heap > >> >> >> > overhead > >> >> >> > created by use of stored field compression. > >> >> >> > > >> >> >> > Here is the use case, once again. The index totals around 300M > >> >> >> > documents, > >> >> >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields > which > >> are > >> >> >> > both > >> >> >> > indexed and stored. It is split into 4 shards, which are > basically > >> >> >> > separate > >> >> >> > indices... if that matters. After the index is populated (but > not > >> >> >> > optimized > >> >> >> > since we don't do that), the overall heap usage taken up by > Lucene > >> is > >> >> >> > over > >> >> >> > 1 GB, much of which is taken up by instances of > >> BlockTreeTermsReader. > >> >> >> > For > >> >> >> > instance for the largest segment in one such an index, the > retained > >> >> heap > >> >> >> > size of the internal tree map is around 50 MB. This is evident > from > >> >> heap > >> >> >> > dump analysis, which I have screenshots of that I can post > here, if > >> >> that > >> >> >> > helps. As there are many segments of various sizes in the > index, as > >> >> >> > expected, the total heap usage for one shard stands at around > 280 > >> MB. > >> >> >> > > >> >> >> > Could someone shed some light on whether this is expected, and > if > >> so - > >> >> >> > how > >> >> >> > could I possibly trim down memory usage here? Is there a way to > >> switch > >> >> >> > to a > >> >> >> > different terms index implementation, one that doesn't preload > all > >> the > >> >> >> > terms into RAM, or only does this partially, i.e. as a cache? > I'm > >> not > >> >> >> > sure > >> >> >> > if I'm framing my questions correctly, as I'm obviously not an > >> expert > >> >> on > >> >> >> > Lucene's internals, but this is going to become a critical issue > >> for > >> >> >> > large > >> >> >> > scale use cases of our system. > >> >> >> > >> >> >> > --------------------------------------------------------------------- > >> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > > --------------------------------------------------------------------- > >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > >> >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >