Yeah, sounds like a leak might be there. Having the huge tlog might have
just magnified it's importance.

Joel Bernstein
Search Engineer at Heliosearch


On Mon, Dec 23, 2013 at 5:17 PM, Greg Preston <gpres...@marinsoftware.com>wrote:

> Interesting.  In my original post, the memory growth (during restart)
> occurs after the tlog is done replaying, but during the merge.
>
> -Greg
>
>
> On Mon, Dec 23, 2013 at 2:06 PM, Joel Bernstein <joels...@gmail.com>
> wrote:
> > Greg,
> >
> > There is a memory component to the tlog, which supports realtime gets.
> This
> > memory component grows until there is a commit, so it will appear like a
> > leak. I suspect that replaying a tlog that was big enough to possibly
> cause
> > OOM is also problematic.
> >
> > One thing you might want to try is going to 15 second commits, and then
> > kill the Solr instance between the commits. Then watch the memory as the
> > replaying occurs with the smaller tlog.
> >
> > Joel
> >
> >
> >
> >
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
> >
> > On Mon, Dec 23, 2013 at 4:17 PM, Greg Preston <
> gpres...@marinsoftware.com>wrote:
> >
> >> Hi Joel,
> >>
> >> Thanks for the suggestion.  I could see how decreasing autoCommit time
> >> would reduce tlog size, and how that could possibly be related to the
> >> original OOM error.  I'm not seeing how that would make any difference
> >> once a tlog exists, though?
> >>
> >> I have a saved off copy of my data dir that has the 13G index and 2.5G
> >> tlog.  So I can reproduce the replay -> merge -> memory usage issue
> >> very quickly.  Changing the autoCommit to possibly avoid the initial
> >> OOM will take a good bit longer to try to reproduce.  I may try that
> >> later in the week.
> >>
> >> -Greg
> >>
> >>
> >> On Mon, Dec 23, 2013 at 12:20 PM, Joel Bernstein <joels...@gmail.com>
> >> wrote:
> >> > Hi Greg,
> >> >
> >> > I have a suspicion that the problem might be related or exacerbated be
> >> > overly large tlogs. Can you adjust your autoCommits to 15 seconds.
> Leave
> >> > openSearcher = false. I would remove the maxDoc as well. If you try
> >> > rerunning under those commit setting it's possible the OOM errors will
> >> stop
> >> > occurring.
> >> >
> >> > Joel
> >> >
> >> > Joel Bernstein
> >> > Search Engineer at Heliosearch
> >> >
> >> >
> >> > On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston <
> >> gpres...@marinsoftware.com>wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> I'm loading up our solr cloud with data (from a solrj client) and
> >> >> running into a weird memory issue.  I can reliably reproduce the
> >> >> problem.
> >> >>
> >> >> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
> >> >> - 24 solr nodes (one shard each), spread across 3 physical hosts,
> each
> >> >> host has 256G of memory
> >> >> - index and tlogs on ssd
> >> >> - Xmx=7G, G1GC
> >> >> - Java 1.7.0_25
> >> >> - schema and solrconfig.xml attached
> >> >>
> >> >> I'm using composite routing to route documents with the same clientId
> >> >> to the same shard.  After several hours of indexing, I occasionally
> >> >> see an IndexWriter go OOM.  I think that's a symptom.  When that
> >> >> happens, indexing continues, and that node's tlog starts to grow.
> >> >> When I notice this, I stop indexing, and bounce the problem node.
> >> >> That's where it gets interesting.
> >> >>
> >> >> Upon bouncing, the tlog replays, and then segments merge.  Once the
> >> >> merging is complete, the heap is fairly full, and forced full GC only
> >> >> helps a little.  But if I then bounce the node again, the heap usage
> >> >> goes way down, and stays low until the next segment merge.  I believe
> >> >> segment merges are also what causes the original OOM.
> >> >>
> >> >> More details:
> >> >>
> >> >> Index on disk for this node is ~13G, tlog is ~2.5G.
> >> >> See attached mem1.png.  This is a jconsole view of the heap during
> the
> >> >> following:
> >> >>
> >> >> (Solr cloud node started at the left edge of this graph)
> >> >>
> >> >> A) One CPU core pegged at 100%.  Thread dump shows:
> >> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
> >> >>         at
> >> >> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
> >> >>         at
> >> >> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
> >> >>         at
> >> >>
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
> >> >>         at
> >> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
> >> >>         at
> >> >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >> >>         at
> >> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >> >>
> >> >> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
> >> >> memory freed.  Thread dump shows:
> >> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> >> >>         at
> >> >>
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> >> >>         at
> >> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> >> >>         at
> >> >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >> >>         at
> >> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >> >>
> >> >> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
> >> >> freed.  Thread dump shows:
> >> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> >> >>         at
> >> >>
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> >> >>         at
> >> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> >> >>         at
> >> >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >> >>         at
> >> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >> >>
> >> >> D) One CPU core pegged at 100%.  Thread dump shows:
> >> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >> >>    java.lang.Thread.State: RUNNABLE
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
> >> >>         at
> >> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
> >> >>         at
> >> >>
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >> >>         at
> >> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >> >>         at
> >> >>
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >> >>
> >> >> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene
> Merge
> >> >> Thread.
> >> >>
> >> >> F) Manually triggered full GC.  Some memory freed, but much remains.
> >> >>
> >> >> G) Restarted solr.  Very little memory used.
> >> >>
> >> >>
> >> >> Throughout all of this, there was no indexing or querying happening.
> >> >> In ordered to try to determine what's using up the memory, I took a
> >> >> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
> >> >> screenshot).  This shows 311 instances of
> Lucene42DocValuesProducer$3,
> >> >> each holding a large byte[].  By attaching a remote debugger and
> >> >> re-running, it looks like there is one of these byte[] for each field
> >> >> in the schema (we have several of the "dim_*" dynamic fields).  And
> as
> >> >> far as I know, I'm not using DocValues at all.
> >> >>
> >> >>
> >> >> Any hints as to what might be going on here would be greatly
> >> >> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
> >> >> willing to try things.  I don't know enough about the internals of
> >> >> solr's memory usage to proceed much further on my own.
> >> >>
> >> >> Thank you.
> >> >>
> >> >> -Greg
> >> >>
> >>
>

Reply via email to