Greg,

There is a memory component to the tlog, which supports realtime gets. This
memory component grows until there is a commit, so it will appear like a
leak. I suspect that replaying a tlog that was big enough to possibly cause
OOM is also problematic.

One thing you might want to try is going to 15 second commits, and then
kill the Solr instance between the commits. Then watch the memory as the
replaying occurs with the smaller tlog.

Joel




Joel Bernstein
Search Engineer at Heliosearch


On Mon, Dec 23, 2013 at 4:17 PM, Greg Preston <gpres...@marinsoftware.com>wrote:

> Hi Joel,
>
> Thanks for the suggestion.  I could see how decreasing autoCommit time
> would reduce tlog size, and how that could possibly be related to the
> original OOM error.  I'm not seeing how that would make any difference
> once a tlog exists, though?
>
> I have a saved off copy of my data dir that has the 13G index and 2.5G
> tlog.  So I can reproduce the replay -> merge -> memory usage issue
> very quickly.  Changing the autoCommit to possibly avoid the initial
> OOM will take a good bit longer to try to reproduce.  I may try that
> later in the week.
>
> -Greg
>
>
> On Mon, Dec 23, 2013 at 12:20 PM, Joel Bernstein <joels...@gmail.com>
> wrote:
> > Hi Greg,
> >
> > I have a suspicion that the problem might be related or exacerbated be
> > overly large tlogs. Can you adjust your autoCommits to 15 seconds. Leave
> > openSearcher = false. I would remove the maxDoc as well. If you try
> > rerunning under those commit setting it's possible the OOM errors will
> stop
> > occurring.
> >
> > Joel
> >
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
> >
> > On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston <
> gpres...@marinsoftware.com>wrote:
> >
> >> Hello,
> >>
> >> I'm loading up our solr cloud with data (from a solrj client) and
> >> running into a weird memory issue.  I can reliably reproduce the
> >> problem.
> >>
> >> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
> >> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
> >> host has 256G of memory
> >> - index and tlogs on ssd
> >> - Xmx=7G, G1GC
> >> - Java 1.7.0_25
> >> - schema and solrconfig.xml attached
> >>
> >> I'm using composite routing to route documents with the same clientId
> >> to the same shard.  After several hours of indexing, I occasionally
> >> see an IndexWriter go OOM.  I think that's a symptom.  When that
> >> happens, indexing continues, and that node's tlog starts to grow.
> >> When I notice this, I stop indexing, and bounce the problem node.
> >> That's where it gets interesting.
> >>
> >> Upon bouncing, the tlog replays, and then segments merge.  Once the
> >> merging is complete, the heap is fairly full, and forced full GC only
> >> helps a little.  But if I then bounce the node again, the heap usage
> >> goes way down, and stays low until the next segment merge.  I believe
> >> segment merges are also what causes the original OOM.
> >>
> >> More details:
> >>
> >> Index on disk for this node is ~13G, tlog is ~2.5G.
> >> See attached mem1.png.  This is a jconsole view of the heap during the
> >> following:
> >>
> >> (Solr cloud node started at the left edge of this graph)
> >>
> >> A) One CPU core pegged at 100%.  Thread dump shows:
> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
> >>         at
> >>
> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
> >>         at
> >> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
> >>         at
> >> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
> >>         at
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >>         at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >>
> >> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
> >> memory freed.  Thread dump shows:
> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at
> >>
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
> >>         at
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
> >>         at
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> >>         at
> >>
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> >>         at
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >>         at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >>
> >> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
> >> freed.  Thread dump shows:
> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at
> >>
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
> >>         at
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
> >>         at
> >>
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> >>         at
> >>
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> >>         at
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >>         at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >>
> >> D) One CPU core pegged at 100%.  Thread dump shows:
> >> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> >> nid=0x7a74 runnable [0x00007f5a41c5f000]
> >>    java.lang.Thread.State: RUNNABLE
> >>         at
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
> >>         at
> >>
> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
> >>         at
> >>
> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
> >>         at
> >>
> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
> >>         at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
> >>         at
> >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
> >>         at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> >>         at
> >>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> >>
> >> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge
> >> Thread.
> >>
> >> F) Manually triggered full GC.  Some memory freed, but much remains.
> >>
> >> G) Restarted solr.  Very little memory used.
> >>
> >>
> >> Throughout all of this, there was no indexing or querying happening.
> >> In ordered to try to determine what's using up the memory, I took a
> >> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
> >> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
> >> each holding a large byte[].  By attaching a remote debugger and
> >> re-running, it looks like there is one of these byte[] for each field
> >> in the schema (we have several of the "dim_*" dynamic fields).  And as
> >> far as I know, I'm not using DocValues at all.
> >>
> >>
> >> Any hints as to what might be going on here would be greatly
> >> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
> >> willing to try things.  I don't know enough about the internals of
> >> solr's memory usage to proceed much further on my own.
> >>
> >> Thank you.
> >>
> >> -Greg
> >>
>

Reply via email to