On 6/3/2013 1:06 AM, Yoni Amir wrote:
> Solrconfig.xml -> http://apaste.info/dsbv
> 
> Schema.xml -> http://apaste.info/67PI
> 
> This solrconfig.xml file has optimization enabled. I had another file which I 
> can't locate at the moment, in which I defined a custom merge scheduler in 
> order to disable optimization.
> 
> When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume 
> there were much more files than that.

I think we have a terminology problem happening here.  There's nothing
you can put in a solrconfig.xml file to enable optimization.  Solr will
only optimize when you explicitly send an optimize command to it.  There
is segment merging, but that's not the same thing.  Segment merging is
completely normal.  Normally it's in the background and indexing will
continue while it's occurring, but if you get too many merges happening
at once, that can stop indexing.  I have a solution for that:

At the following URL s my indexConfig section, geared towards heavy
indexing.  The TieredMergePolicy settings are the equivalent of a legacy
mergeFactor of 35.  I've gone with a lower-than-default ramBufferSizeMB
here, to reduce memory usage.  The default value for this setting as of
version 4.1 is 100:

http://apaste.info/4gaD

One thing that this configuration does which might directly impact on
your setup is increase the maxMergeCount.  I believe the default value
for this is 3.  This means that if you get more than three "levels" of
merging happening at the same time, indexing will stop until until the
number of levels drops.  Because Solr always does the biggest merge
first, this can really take a long time.  The combination of a large
mergeFactor and a larger-than-normal maxMergeCount will ensure that this
situation never happens.

If you are not using SSD, don't increase maxThreadCount beyond one.  The
random-access characteristics of regular hard disks will make things go
slower with more threads, not faster.  With SSD, increasing the threads
can make things go faster.

There's a few high memory use things going on in your config/schema.

The first thing that jumped out at me is facets.  They use a lot of
memory.  You can greatly reduce the memory use by adding
&facet.method=enum to the query.  The default for the method is fc,
which means fieldcache.  The size of the Lucene fieldcache cannot be
directly controlled by Solr, unlike Solr's own caches.  It gets as big
as it needs to be, and facets using the fc method will put all the facet
data for the entire index in the fieldcache.

The second thing that jumped out at me is the fact that all_text is
being stored.  Apparently this is for highlighting.  I will admit that I
do not know anything about highlighting, so you might need separate help
there.  You are using edismax for your query parser, which is perfectly
capable of searching all the fields that make up all_text, so in my
mind, all_text doesn't need to exist at all.

If you wrote a custom merge scheduler that disables merging, that's
probably why you're getting over 1000 segments.  Having a really huge
number of segments can also cause memory issues, because each one needs
its own memory structures.  If you've got a few dozen of them, that's no
big deal, but 1000+ is.

One side issue: I noticed that you edited and duplicated the /browse
handler.  The velocity templates are not meant for production use.  They
are made to illustrate Solr's capability without a lot of coding.

In order for your audience to use the /browse handler, you have to open
your Solr instance up to the entire audience, which might be the entire
Internet.  Anyone who can reach the Solr interface is capable of erasing
your index and changing your index.  Even if you take steps to prevent
that, they can send denial of service query attacks.

Thanks,
Shawn

Reply via email to