Thanks Shawn,
This was very helpful. Indeed I had some terminology problem regarding the 
segment merging. In any case, I tweaked those parameters that you recommended 
and it helped a lot.

I was wondering about your recommendation to use facet.method=enum? Can you 
explain what is the trade-off here? I understand that I gain a benefit by using 
less memory, but what with I lose? Is it speed?

Also, do you know if there is an answer to my original question in this thread? 
Solr has a queue of incoming requests, which, in my case, kept on growing. I 
looked at the code but couldn't find it, I think maybe it is an implicit queue 
in the form of Java's concurrent thread pool or something like that.

Is it possible to limit the size of this queue, or to determine its size during 
runtime? This is the last issue that I am trying to figure out right now.

Also, to answer your question about the field all_text: all the fields are 
stored in order to support partial-update of documents. Most of the fields are 
used for highlighting, all_text is used for searching. I'll gladly omit 
all_text from being stored, but then partial-update won't work.
The reason I didn't use edismax to search all the fields, is because the list 
of all fields is very long. Can edismax handle several hundred fields in the 
list? What about dynamic fields? Edismax requires the list to be fixed in the 
configuration file, so I can't include dynamic fields there. I can pass along 
the full list in the 'qf' parameter in every search request, but this seems 
like a waste? Also, what about performance? I was told that the best practice 
in this case (you have lots of fields and want to search everything) is to copy 
everything to a catch-all field.

Thanks again,
Yoni

-----Original Message-----
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Monday, June 03, 2013 17:08
To: solr-user@lucene.apache.org
Subject: Re: out of memory during indexing do to large incoming queue

On 6/3/2013 1:06 AM, Yoni Amir wrote:
> Solrconfig.xml -> http://apaste.info/dsbv
> 
> Schema.xml -> http://apaste.info/67PI
> 
> This solrconfig.xml file has optimization enabled. I had another file which I 
> can't locate at the moment, in which I defined a custom merge scheduler in 
> order to disable optimization.
> 
> When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume 
> there were much more files than that.

I think we have a terminology problem happening here.  There's nothing you can 
put in a solrconfig.xml file to enable optimization.  Solr will only optimize 
when you explicitly send an optimize command to it.  There is segment merging, 
but that's not the same thing.  Segment merging is completely normal.  Normally 
it's in the background and indexing will continue while it's occurring, but if 
you get too many merges happening at once, that can stop indexing.  I have a 
solution for that:

At the following URL s my indexConfig section, geared towards heavy indexing.  
The TieredMergePolicy settings are the equivalent of a legacy mergeFactor of 
35.  I've gone with a lower-than-default ramBufferSizeMB here, to reduce memory 
usage.  The default value for this setting as of version 4.1 is 100:

http://apaste.info/4gaD

One thing that this configuration does which might directly impact on your 
setup is increase the maxMergeCount.  I believe the default value for this is 
3.  This means that if you get more than three "levels" of merging happening at 
the same time, indexing will stop until until the number of levels drops.  
Because Solr always does the biggest merge first, this can really take a long 
time.  The combination of a large mergeFactor and a larger-than-normal 
maxMergeCount will ensure that this situation never happens.

If you are not using SSD, don't increase maxThreadCount beyond one.  The 
random-access characteristics of regular hard disks will make things go slower 
with more threads, not faster.  With SSD, increasing the threads can make 
things go faster.

There's a few high memory use things going on in your config/schema.

The first thing that jumped out at me is facets.  They use a lot of memory.  
You can greatly reduce the memory use by adding &facet.method=enum to the 
query.  The default for the method is fc, which means fieldcache.  The size of 
the Lucene fieldcache cannot be directly controlled by Solr, unlike Solr's own 
caches.  It gets as big as it needs to be, and facets using the fc method will 
put all the facet data for the entire index in the fieldcache.

The second thing that jumped out at me is the fact that all_text is being 
stored.  Apparently this is for highlighting.  I will admit that I do not know 
anything about highlighting, so you might need separate help there.  You are 
using edismax for your query parser, which is perfectly capable of searching 
all the fields that make up all_text, so in my mind, all_text doesn't need to 
exist at all.

If you wrote a custom merge scheduler that disables merging, that's probably 
why you're getting over 1000 segments.  Having a really huge number of segments 
can also cause memory issues, because each one needs its own memory structures. 
 If you've got a few dozen of them, that's no big deal, but 1000+ is.

One side issue: I noticed that you edited and duplicated the /browse handler.  
The velocity templates are not meant for production use.  They are made to 
illustrate Solr's capability without a lot of coding.

In order for your audience to use the /browse handler, you have to open your 
Solr instance up to the entire audience, which might be the entire Internet.  
Anyone who can reach the Solr interface is capable of erasing your index and 
changing your index.  Even if you take steps to prevent that, they can send 
denial of service query attacks.

Thanks,
Shawn


Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.

Reply via email to