Hi Greg,

I have a suspicion that the problem might be related or exacerbated be
overly large tlogs. Can you adjust your autoCommits to 15 seconds. Leave
openSearcher = false. I would remove the maxDoc as well. If you try
rerunning under those commit setting it's possible the OOM errors will stop
occurring.

Joel

Joel Bernstein
Search Engineer at Heliosearch


On Mon, Dec 23, 2013 at 3:07 PM, Greg Preston <gpres...@marinsoftware.com>wrote:

> Hello,
>
> I'm loading up our solr cloud with data (from a solrj client) and
> running into a weird memory issue.  I can reliably reproduce the
> problem.
>
> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0)
> - 24 solr nodes (one shard each), spread across 3 physical hosts, each
> host has 256G of memory
> - index and tlogs on ssd
> - Xmx=7G, G1GC
> - Java 1.7.0_25
> - schema and solrconfig.xml attached
>
> I'm using composite routing to route documents with the same clientId
> to the same shard.  After several hours of indexing, I occasionally
> see an IndexWriter go OOM.  I think that's a symptom.  When that
> happens, indexing continues, and that node's tlog starts to grow.
> When I notice this, I stop indexing, and bounce the problem node.
> That's where it gets interesting.
>
> Upon bouncing, the tlog replays, and then segments merge.  Once the
> merging is complete, the heap is fairly full, and forced full GC only
> helps a little.  But if I then bounce the node again, the heap usage
> goes way down, and stays low until the next segment merge.  I believe
> segment merges are also what causes the original OOM.
>
> More details:
>
> Index on disk for this node is ~13G, tlog is ~2.5G.
> See attached mem1.png.  This is a jconsole view of the heap during the
> following:
>
> (Solr cloud node started at the left edge of this graph)
>
> A) One CPU core pegged at 100%.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.lucene.util.fst.Builder.add(Builder.java:397)
>         at
> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000)
>         at
> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112)
>         at
> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
>         at
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365)
>         at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>         at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> B) One CPU core pegged at 100%.  Manually triggered GC.  Lots of
> memory freed.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>         at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144)
>         at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>         at
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>         at
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>         at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>         at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> C) One CPU core pegged at 100%.  Manually triggered GC.  No memory
> freed.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at
> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127)
>         at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108)
>         at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
>         at
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
>         at
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
>         at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
>         at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> D) One CPU core pegged at 100%.  Thread dump shows:
> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800
> nid=0x7a74 runnable [0x00007f5a41c5f000]
>    java.lang.Thread.State: RUNNABLE
>         at
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322)
>         at
> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169)
>         at
> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789)
>         at
> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312)
>         at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130)
>         at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
>         at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
>
> E) CPU usage drops to nominal levels.  Thread dump shows no Lucene Merge
> Thread.
>
> F) Manually triggered full GC.  Some memory freed, but much remains.
>
> G) Restarted solr.  Very little memory used.
>
>
> Throughout all of this, there was no indexing or querying happening.
> In ordered to try to determine what's using up the memory, I took a
> heap dump at point (F) and analyzed it in Eclipse MAT (see attached
> screenshot).  This shows 311 instances of Lucene42DocValuesProducer$3,
> each holding a large byte[].  By attaching a remote debugger and
> re-running, it looks like there is one of these byte[] for each field
> in the schema (we have several of the "dim_*" dynamic fields).  And as
> far as I know, I'm not using DocValues at all.
>
>
> Any hints as to what might be going on here would be greatly
> appreciated.  It takes me about 10 minutes to reproduce this, so I'm
> willing to try things.  I don't know enough about the internals of
> solr's memory usage to proceed much further on my own.
>
> Thank you.
>
> -Greg
>

Reply via email to