RE: out of memory during indexing do to large incoming queue

2013-06-17 Thread Yoni Amir
Thanks Shawn,
This was very helpful. Indeed I had some terminology problem regarding the 
segment merging. In any case, I tweaked those parameters that you recommended 
and it helped a lot.

I was wondering about your recommendation to use facet.method=enum? Can you 
explain what is the trade-off here? I understand that I gain a benefit by using 
less memory, but what with I lose? Is it speed?

Also, do you know if there is an answer to my original question in this thread? 
Solr has a queue of incoming requests, which, in my case, kept on growing. I 
looked at the code but couldn't find it, I think maybe it is an implicit queue 
in the form of Java's concurrent thread pool or something like that.

Is it possible to limit the size of this queue, or to determine its size during 
runtime? This is the last issue that I am trying to figure out right now.

Also, to answer your question about the field all_text: all the fields are 
stored in order to support partial-update of documents. Most of the fields are 
used for highlighting, all_text is used for searching. I'll gladly omit 
all_text from being stored, but then partial-update won't work.
The reason I didn't use edismax to search all the fields, is because the list 
of all fields is very long. Can edismax handle several hundred fields in the 
list? What about dynamic fields? Edismax requires the list to be fixed in the 
configuration file, so I can't include dynamic fields there. I can pass along 
the full list in the 'qf' parameter in every search request, but this seems 
like a waste? Also, what about performance? I was told that the best practice 
in this case (you have lots of fields and want to search everything) is to copy 
everything to a catch-all field.

Thanks again,
Yoni

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Monday, June 03, 2013 17:08
To: solr-user@lucene.apache.org
Subject: Re: out of memory during indexing do to large incoming queue

On 6/3/2013 1:06 AM, Yoni Amir wrote:
 Solrconfig.xml - http://apaste.info/dsbv
 
 Schema.xml - http://apaste.info/67PI
 
 This solrconfig.xml file has optimization enabled. I had another file which I 
 can't locate at the moment, in which I defined a custom merge scheduler in 
 order to disable optimization.
 
 When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume 
 there were much more files than that.

I think we have a terminology problem happening here.  There's nothing you can 
put in a solrconfig.xml file to enable optimization.  Solr will only optimize 
when you explicitly send an optimize command to it.  There is segment merging, 
but that's not the same thing.  Segment merging is completely normal.  Normally 
it's in the background and indexing will continue while it's occurring, but if 
you get too many merges happening at once, that can stop indexing.  I have a 
solution for that:

At the following URL s my indexConfig section, geared towards heavy indexing.  
The TieredMergePolicy settings are the equivalent of a legacy mergeFactor of 
35.  I've gone with a lower-than-default ramBufferSizeMB here, to reduce memory 
usage.  The default value for this setting as of version 4.1 is 100:

http://apaste.info/4gaD

One thing that this configuration does which might directly impact on your 
setup is increase the maxMergeCount.  I believe the default value for this is 
3.  This means that if you get more than three levels of merging happening at 
the same time, indexing will stop until until the number of levels drops.  
Because Solr always does the biggest merge first, this can really take a long 
time.  The combination of a large mergeFactor and a larger-than-normal 
maxMergeCount will ensure that this situation never happens.

If you are not using SSD, don't increase maxThreadCount beyond one.  The 
random-access characteristics of regular hard disks will make things go slower 
with more threads, not faster.  With SSD, increasing the threads can make 
things go faster.

There's a few high memory use things going on in your config/schema.

The first thing that jumped out at me is facets.  They use a lot of memory.  
You can greatly reduce the memory use by adding facet.method=enum to the 
query.  The default for the method is fc, which means fieldcache.  The size of 
the Lucene fieldcache cannot be directly controlled by Solr, unlike Solr's own 
caches.  It gets as big as it needs to be, and facets using the fc method will 
put all the facet data for the entire index in the fieldcache.

The second thing that jumped out at me is the fact that all_text is being 
stored.  Apparently this is for highlighting.  I will admit that I do not know 
anything about highlighting, so you might need separate help there.  You are 
using edismax for your query parser, which is perfectly capable of searching 
all the fields that make up all_text, so in my mind, all_text doesn't need to 
exist at all.

If you wrote a custom merge scheduler that disables

Re: out of memory during indexing do to large incoming queue

2013-06-17 Thread Shawn Heisey

On 6/17/2013 4:32 AM, Yoni Amir wrote:

I was wondering about your recommendation to use facet.method=enum? Can you 
explain what is the trade-off here? I understand that I gain a benefit by using 
less memory, but what with I lose? Is it speed?


The problem with facet.method=fc (the default) and memeory is that every 
field and query that you use for faceting ends up separately cached in 
the FieldCache, and the memory required grows as your index grows.  If 
you only use facets on one or two fields, then the normal method is 
fine, and subsequent facets will be faster.  It does eat a lot of java 
heap memory, though ... and the bigger your java heap is, the more 
problems you'll have with garbage collection.


With enum, it must gather the data out of the index for every facet run. 
 If you have plenty of extra memory for the OS disk cache, this is not 
normally a major issue, because it will be pulled out of RAM, similar to 
what happens with fc, except that it's not java heap memory.  The OS is 
a lot more efficient with how it uses memory than Java is.



Also, do you know if there is an answer to my original question in this thread? 
Solr has a queue of incoming requests, which, in my case, kept on growing. I 
looked at the code but couldn't find it, I think maybe it is an implicit queue 
in the form of Java's concurrent thread pool or something like that.

Is it possible to limit the size of this queue, or to determine its size during 
runtime? This is the last issue that I am trying to figure out right now.


I do not know the answer to this.


Also, to answer your question about the field all_text: all the fields are 
stored in order to support partial-update of documents. Most of the fields are 
used for highlighting, all_text is used for searching. I'll gladly omit 
all_text from being stored, but then partial-update won't work.


Your copyFields will still work just fine with atomic updates even if 
they are not stored.  Behind the scenes, an atomic update is a delete 
and an add with the stored data plus the changes... if all your source 
fields are stored, then the copyField should be generated correctly from 
all the source fields.


The wiki page on the subject actually says that copyField destinations 
*MUST* be set to stored=false.


http://wiki.apache.org/solr/Atomic_Updates#Caveats_and_Limitations


The reason I didn't use edismax to search all the fields, is because the list 
of all fields is very long. Can edismax handle several hundred fields in the 
list? What about dynamic fields? Edismax requires the list to be fixed in the 
configuration file, so I can't include dynamic fields there. I can pass along 
the full list in the 'qf' parameter in every search request, but this seems 
like a waste? Also, what about performance? I was told that the best practice 
in this case (you have lots of fields and want to search everything) is to copy 
everything to a catch-all field.


If there is ever any situation where you can come up with some searches 
that only need to search against some of the fields and other searches 
that need to search against different fields, then you might consider 
creating different search handlers with different qf lists.  If you 
always want to search against all the fields, then it's probably more 
efficient to keep your current method.


Thanks,
Shawn



RE: out of memory during indexing do to large incoming queue

2013-06-03 Thread Yoni Amir
Solrconfig.xml - http://apaste.info/dsbv

Schema.xml - http://apaste.info/67PI

This solrconfig.xml file has optimization enabled. I had another file which I 
can't locate at the moment, in which I defined a custom merge scheduler in 
order to disable optimization.

When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume 
there were much more files than that.

Thanks,
Yoni



-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Sunday, June 02, 2013 22:53
To: solr-user@lucene.apache.org
Subject: Re: out of memory during indexing do to large incoming queue

On 6/2/2013 12:25 PM, Yoni Amir wrote:
 Hi Shawn and Shreejay, thanks for the response.
 Here is some more information:
 1) The machine is a virtual machine on ESX server. It has 4 CPUs and 
 8GB of RAM. I don't remember what CPU but something modern enough. It 
 is running Java 7 without any special parameters, and 4GB allocated 
 for Java (-Xmx)
 2) After successful indexing, I have 2.5 Million documents, 117GB index size. 
 This is the size after it was optimized.
 3) I plan to upgrade to 4.3 just didn't have time. 4.0 beta is what was 
 available at the time that we had a release deadline.
 4) The setup with master-slave replication, not Solr Cloud. The server that I 
 am discussing is the indexing server, and in these tests there were actually 
 no slaves involved, and virtually zero searches performed.
 5) Attached is my configuration. I tried to disable the warm-up and opening 
 of searchers, it didn't change anything. The commits are done by Solr, using 
 autocommit. The client sends the updates without a commit command.
 6) I want to disable optimization, but when I disabled it, the OOME occurred 
 even faster. The number of segments reached around a thousand within an hour 
 or so. I don't know if it's normal or not, but at that point if I restarted 
 Solr it immediately took about 1GB of heap space just on start-up, instead of 
 the usual 50MB or so.
 
 If I commit less frequently, don't I increase the risk of losing data, e.g., 
 if the power goes down, etc.?
 If I disable optimization, is it necessary to avoid such a large number of 
 segments? Is it possible?

Last part first: Losing data is much less of a risk with Solr 4.x, if you have 
enabled the updateLog.

We'll need some more info.  See the end of the message for specifics.

Right off the bat, I can tell you that with an index that's 117GB, you're going 
to need a LOT of RAM.

Each of my 4.2.1 servers has 42GB of index and about 37 million documents 
between all the index shards.  The web application never uses facets, which 
tend to use a lot of memory.  My index is a lot smaller than yours, and I need 
a 6GB heap, seeing OOM errors if it's only 4GB.
You probably need at least an 8GB heap, and possibly larger.

Beyond the amount of memory that Solr itself uses, for good performance you 
will also need a large amount of memory for OS disk caching.  Unless the server 
is using SSD, you need to allocate at least 64GB of real memory to the virtual 
machine.  If you've got your index on SSD, 32GB might be enough.  I've got 64GB 
total on my servers.

http://wiki.apache.org/solr/SolrPerformanceProblems

When you say that there are over 1000 segments, are you seeing 1000 files, or 
are there literally 1000 segments, giving you between 12000 and 15000 files?  
Even if your mergeFactor were higher than the default 10, that just shouldn't 
happen.

Can you share your solrconfig.xml and schema.xml?  Use a paste website like 
http://apaste.info and share the URLs.

Thanks,
Shawn


Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.



Re: out of memory during indexing do to large incoming queue

2013-06-02 Thread Shreejay
A couple of things: 

1) can you give some more details about your setup ? Like whether its cloud or 
single instance . How many nodes if its cloud.  The hardware - memory per 
machine , JVM options. Etc 

2) any specific reason for using 4.0 beta? The latest version is 4.3. I used 
4.0 for a few weeks and there were a lot if bugs related to memory and 
communication between nodes ( zookeeper) 
3) if you haven't seen it already , please go through this wiki page . It's an 
excellent starting point for troubleshooting memory n indexing issues. 
Specially section 3 to 7 
http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations


-- 
Shreejay


On Sunday, June 2, 2013 at 7:16, Yoni Amir wrote:

 Hello,
 I am receiving OutOfMemoryError during indexing, and after investigating the 
 heap dump, I am still missing some information, and I thought this might be a 
 good place for help.
 
 I am using Solr 4.0 beta, and I have 5 threads that send update requests to 
 Solr. Each request is a bulk of 100 SolrInputDocuments (using solrj), and my 
 goal is to index around 2.5 million documents.
 Solr is configured to do a hard-commit every 10 seconds, so initially I 
 thought that it can only accumulate in memory 10 seconds worth of updates, 
 but that's not the case. I can see in a profiler how it accumulates memory 
 over time, even with 4 to 6 GB of memory. It is also configured to optimize 
 with mergeFactor=10.
 
 At first I thought that optimization is a blocking, synchronous operation. It 
 is, in the sense that the index can't be updated during optimization. 
 However, it is not synchronous, in the sense that the update request coming 
 from my code is not blocked - Solr just returns an OK response, even while 
 the index is optimizing.
 This indicates that Solr has an internal queue of inbound requests, and that 
 the OK response just means that it is in the queue. I get confirmation for 
 this from a friend who is a Solr expert (or so I hope).
 
 My main question is: how can I put a bound on this internal queue, and make 
 update requests synchronous in case the queue is full? Put it another way, I 
 need to know if Solr is really ready to receive more requests, so I don't 
 overload it and cause OOME.
 
 I performed several tests, with slow and fast disks, and on the really fasts 
 disk the problem didn't occur. However, I can't demand such fast disk from 
 all the clients, and also even with a fast disk the problem will occur 
 eventually when I try to index 10 million documents.
 I also tried to perform indexing with optimization disabled, but it didn't 
 help.
 
 Thanks,
 Yoni
 
 Confidentiality: This communication and any attachments are intended for the 
 above-named persons only and may be confidential and/or legally privileged. 
 Any opinions expressed in this communication are not necessarily those of 
 NICE Actimize. If this communication has come to you in error you must take 
 no action based on it, nor must you copy or show it to anyone; please 
 delete/destroy and inform the sender by e-mail immediately. 
 Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
 Viruses: Although we have taken steps toward ensuring that this e-mail and 
 attachments are free from any virus, we advise that in keeping with good 
 computing practice the recipient should ensure they are actually virus free.
 
 




Re: out of memory during indexing do to large incoming queue

2013-06-02 Thread Shawn Heisey
On 6/2/2013 8:16 AM, Yoni Amir wrote:
 Hello,
 I am receiving OutOfMemoryError during indexing, and after investigating the 
 heap dump, I am still missing some information, and I thought this might be a 
 good place for help.
 
 I am using Solr 4.0 beta, and I have 5 threads that send update requests to 
 Solr. Each request is a bulk of 100 SolrInputDocuments (using solrj), and my 
 goal is to index around 2.5 million documents.
 Solr is configured to do a hard-commit every 10 seconds, so initially I 
 thought that it can only accumulate in memory 10 seconds worth of updates, 
 but that's not the case. I can see in a profiler how it accumulates memory 
 over time, even with 4 to 6 GB of memory. It is also configured to optimize 
 with mergeFactor=10.

4.0-BETA came out several months ago.  Even at the time, support for the
alpha and beta releases was limited.  Now it has been superseded by
4.0.0, 4.1.0, 4.2.0, 4.2.1, and 4.3.0, all of which are full releases.
There is a 4.3.1 release currently in the works.  Please upgrade.

Ten seconds is a very short interval for hard commits, even if you have
openSearcher=false.  Frequent hard commits can cause a whole host of
problems.  It's better to have an interval of several minutes, and I
wouldn't go less than a minute.  Soft commits can be much more frequent,
but if you are frequently opening new searchers, you'll probably want to
disable cache warming.

On optimization: don't do it unless you absolutely must.  Most of the
time, optimization is only needed if you delete a lot of documents and
you need to get them removed from your index.  If you must optimize to
get rid of deleted documents, do it on a very long interval (once a day,
once a week) and pause indexing during optimization.

You haven't said anything about your index size, java heap size, total
RAM, etc.  With those numbers I could offer some guesses about what you
need, but I'll warn you that they would only be guesses - watching a
system with real data under load is the only way to get concrete
information.  Here are some basic guidelines on performance problems and
RAM information:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



RE: out of memory during indexing do to large incoming queue

2013-06-02 Thread Yoni Amir
Hi Shawn and Shreejay, thanks for the response.
Here is some more information:
1) The machine is a virtual machine on ESX server. It has 4 CPUs and 8GB of 
RAM. I don't remember what CPU but something modern enough. It is running Java 
7 without any special parameters, and 4GB allocated for Java (-Xmx)
2) After successful indexing, I have 2.5 Million documents, 117GB index size. 
This is the size after it was optimized.
3) I plan to upgrade to 4.3 just didn't have time. 4.0 beta is what was 
available at the time that we had a release deadline.
4) The setup with master-slave replication, not Solr Cloud. The server that I 
am discussing is the indexing server, and in these tests there were actually no 
slaves involved, and virtually zero searches performed.
5) Attached is my configuration. I tried to disable the warm-up and opening of 
searchers, it didn't change anything. The commits are done by Solr, using 
autocommit. The client sends the updates without a commit command.
6) I want to disable optimization, but when I disabled it, the OOME occurred 
even faster. The number of segments reached around a thousand within an hour or 
so. I don't know if it's normal or not, but at that point if I restarted Solr 
it immediately took about 1GB of heap space just on start-up, instead of the 
usual 50MB or so.

If I commit less frequently, don't I increase the risk of losing data, e.g., if 
the power goes down, etc.?
If I disable optimization, is it necessary to avoid such a large number of 
segments? Is it possible?

Thanks again,
Yoni



-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Sunday, June 02, 2013 18:05
To: solr-user@lucene.apache.org
Subject: Re: out of memory during indexing do to large incoming queue

On 6/2/2013 8:16 AM, Yoni Amir wrote:
 Hello,
 I am receiving OutOfMemoryError during indexing, and after investigating the 
 heap dump, I am still missing some information, and I thought this might be a 
 good place for help.
 
 I am using Solr 4.0 beta, and I have 5 threads that send update requests to 
 Solr. Each request is a bulk of 100 SolrInputDocuments (using solrj), and my 
 goal is to index around 2.5 million documents.
 Solr is configured to do a hard-commit every 10 seconds, so initially I 
 thought that it can only accumulate in memory 10 seconds worth of updates, 
 but that's not the case. I can see in a profiler how it accumulates memory 
 over time, even with 4 to 6 GB of memory. It is also configured to optimize 
 with mergeFactor=10.

4.0-BETA came out several months ago.  Even at the time, support for the alpha 
and beta releases was limited.  Now it has been superseded by 4.0.0, 4.1.0, 
4.2.0, 4.2.1, and 4.3.0, all of which are full releases.
There is a 4.3.1 release currently in the works.  Please upgrade.

Ten seconds is a very short interval for hard commits, even if you have 
openSearcher=false.  Frequent hard commits can cause a whole host of problems.  
It's better to have an interval of several minutes, and I wouldn't go less than 
a minute.  Soft commits can be much more frequent, but if you are frequently 
opening new searchers, you'll probably want to disable cache warming.

On optimization: don't do it unless you absolutely must.  Most of the time, 
optimization is only needed if you delete a lot of documents and you need to 
get them removed from your index.  If you must optimize to get rid of deleted 
documents, do it on a very long interval (once a day, once a week) and pause 
indexing during optimization.

You haven't said anything about your index size, java heap size, total RAM, 
etc.  With those numbers I could offer some guesses about what you need, but 
I'll warn you that they would only be guesses - watching a system with real 
data under load is the only way to get concrete information.  Here are some 
basic guidelines on performance problems and RAM information:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn


Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.
?xml version=1.0 encoding=UTF-8 ?
config
	luceneMatchVersionLUCENE_40/luceneMatchVersion

	dataDir${solr.data.dir:}/dataDir

	directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory} /

	indexConfig

		!-- ramBufferSizeMB sets the amount

Re: out of memory during indexing do to large incoming queue

2013-06-02 Thread Shawn Heisey
On 6/2/2013 12:25 PM, Yoni Amir wrote:
 Hi Shawn and Shreejay, thanks for the response.
 Here is some more information:
 1) The machine is a virtual machine on ESX server. It has 4 CPUs and 8GB of 
 RAM. I don't remember what CPU but something modern enough. It is running 
 Java 7 without any special parameters, and 4GB allocated for Java (-Xmx)
 2) After successful indexing, I have 2.5 Million documents, 117GB index size. 
 This is the size after it was optimized.
 3) I plan to upgrade to 4.3 just didn't have time. 4.0 beta is what was 
 available at the time that we had a release deadline.
 4) The setup with master-slave replication, not Solr Cloud. The server that I 
 am discussing is the indexing server, and in these tests there were actually 
 no slaves involved, and virtually zero searches performed.
 5) Attached is my configuration. I tried to disable the warm-up and opening 
 of searchers, it didn't change anything. The commits are done by Solr, using 
 autocommit. The client sends the updates without a commit command.
 6) I want to disable optimization, but when I disabled it, the OOME occurred 
 even faster. The number of segments reached around a thousand within an hour 
 or so. I don't know if it's normal or not, but at that point if I restarted 
 Solr it immediately took about 1GB of heap space just on start-up, instead of 
 the usual 50MB or so.
 
 If I commit less frequently, don't I increase the risk of losing data, e.g., 
 if the power goes down, etc.?
 If I disable optimization, is it necessary to avoid such a large number of 
 segments? Is it possible?

Last part first: Losing data is much less of a risk with Solr 4.x, if
you have enabled the updateLog.

We'll need some more info.  See the end of the message for specifics.

Right off the bat, I can tell you that with an index that's 117GB,
you're going to need a LOT of RAM.

Each of my 4.2.1 servers has 42GB of index and about 37 million
documents between all the index shards.  The web application never uses
facets, which tend to use a lot of memory.  My index is a lot smaller
than yours, and I need a 6GB heap, seeing OOM errors if it's only 4GB.
You probably need at least an 8GB heap, and possibly larger.

Beyond the amount of memory that Solr itself uses, for good performance
you will also need a large amount of memory for OS disk caching.  Unless
the server is using SSD, you need to allocate at least 64GB of real
memory to the virtual machine.  If you've got your index on SSD, 32GB
might be enough.  I've got 64GB total on my servers.

http://wiki.apache.org/solr/SolrPerformanceProblems

When you say that there are over 1000 segments, are you seeing 1000
files, or are there literally 1000 segments, giving you between 12000
and 15000 files?  Even if your mergeFactor were higher than the default
10, that just shouldn't happen.

Can you share your solrconfig.xml and schema.xml?  Use a paste website
like http://apaste.info and share the URLs.

Thanks,
Shawn