Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

2013-01-27 Thread Shawn Heisey

On 1/27/2013 10:28 PM, Rahul Bishnoi wrote:

Thanks for your reply. After following your suggestions we were able to
index 30k documents. I have some queries:

1) What is stored in the RAM while only indexing is going on?  How to
calculate the RAM/heap requirements for our documents?
2) The document cache, filter cache, etc...are populated while querying.
Correct me if I am wrong. Are there any caches that are populated while
indexing?


If anyone catches me making statements that are not true, please feel 
free to correct me.


The caches are indeed only used during querying.  If you are not making 
queries at all, they aren't much of a factor.


I can't give you any definitive answers to your question about RAM usage 
and how to calculate RAM/heap requirements.  I can make some general 
statements without looking at the code, just based on what I've learned 
so far about Solr, and about Java in general.


You would have an exact copy of the input text for each field initially, 
which would ultimately get used for the stored data (for those fields 
that are stored).  Each one is probably just a plain String, though I 
don't know as I haven't read the code.  If the field is not being stored 
or copied, then it would be possible to get rid of that data as soon as 
it is no longer required for indexing.  I don't have any idea whether 
Solr/Lucene code actually gets rid of the exact copy in this way.


If you are storing termvectors, additional memory would be needed for 
that.  I don't know if that involves lots of objects or if it's one 
object with index information.  Based on my experience, termvectors can 
be bigger than the stored data for the same field.


Tokenization and filtering is where I imagine that most of the memory 
would get used.  If you're using a filter like EdgeNGram, that's a LOT 
of tokens.  Even if you're just tokenizing words, it can add up.  There 
is also space required for the inverted index, norms, and other 
data/metadata.  If each token is a separate Java object (which I do not 
know), there would be a fair amount of memory overhead involved.


A String object in java has something like 40 bytes of overhead above 
and beyond the space required for the data.  Also, strings in Java are 
internally represented in UTF-16, so each character actually takes two 
bytes.


http://www.javamex.com/tutorials/memory/string_memory_usage.shtml

The finished documents stack up in the ramBufferSizeMB space until it 
gets full or a hard commit is issued, at which point they are flushed to 
disk as a Lucene segment.  One thing that I'm not sure about is whether 
an additional ram buffer is allocated for further indexing while the 
flush is happening, or if it flushes and then re-uses the buffer for 
subsequent documents.


Another way that it can use memory is when merging index segments.  I 
don't know how much memory gets used for this process.


On Solr 4 with the default directory factory, part of a flushed segment 
may remain in RAM until enough additional segment data is created.  The 
amount of memory used by this feature should be pretty small, unless you 
have a lot of cores on a single JVM.  That extra memory can be 
eliminated by using MMapDirectoryFactory instead of 
NRTCachingDirectoryFactory, at the expense of fast Near-RealTime index 
updates.


Thanks,
Shawn



Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

2013-01-27 Thread Rahul Bishnoi
Hi Shawn,

Thanks for your reply. After following your suggestions we were able to
index 30k documents. I have some queries:

1) What is stored in the RAM while only indexing is going on?  How to
calculate the RAM/heap requirements for our documents?
2) The document cache, filter cache, etc...are populated while querying.
Correct me if I am wrong. Are there any caches that are populated while
indexing?

Thanks,
Rahul



On Sat, Jan 26, 2013 at 11:46 PM, Shawn Heisey  wrote:

> On 1/26/2013 12:55 AM, Rahul Bishnoi wrote:
>
>> Thanks for quick reply and addressing each point queried.
>>
>> Additional asked information is mentioned below:
>>
>> OS = Ubuntu 12.04 (64 bit)
>> Sun Java 7 (64 bit)
>> Total RAM = 8GB
>>
>> SolrConfig.xml is available at http://pastebin.com/SEFxkw2R
>>
>>
> Rahul,
>
> The MaxPermGenSize could be a contributing factor.  The documents where
> you have 1000 words are somewhat large, though your overall index size is
> pretty small.  I would try removing the MaxPermGenSize option and see what
> happens.  You can also try reducing the ramBufferSizeMB in solrconfig.xml.
>  The default in previous versions of Solr was 32, which is big enough for
> most things, unless you are indexing HUGE documents like entire books.
>
> It looks like you have the cache sizes under  at values close to
> default.  I wouldn't decrease the documentCache any - in fact an increase
> might be a good thing there.  As for the others, you could probably reduce
> them.  The filterCache size I would start at 64 or 128.  Watch your cache
> hitratios to see whether the changes make things remarkably worse.
>
> If that doesn't help, try increasing the -Xmx option - first 3072m, next
> 4096m.  You could go as high as 6GB and not run into any OS cache problems
> with your small index size, though you might run into long GC pauses.
>
> Indexing, especially big documents, is fairly memory intensive.  Some
> queries can be memory intensive as well, especially those using facets or a
> lot of clauses.
>
> Under normal operation, I could probably get away with a 3GB heap size,
> but I have it at 8GB because otherwise a full reindex (full-import from
> mysql) runs into OOM errors.
>
> Thanks,
> Shawn
>
>


Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

2013-01-26 Thread Shawn Heisey

On 1/26/2013 12:55 AM, Rahul Bishnoi wrote:

Thanks for quick reply and addressing each point queried.

Additional asked information is mentioned below:

OS = Ubuntu 12.04 (64 bit)
Sun Java 7 (64 bit)
Total RAM = 8GB

SolrConfig.xml is available at http://pastebin.com/SEFxkw2R



Rahul,

The MaxPermGenSize could be a contributing factor.  The documents where 
you have 1000 words are somewhat large, though your overall index size 
is pretty small.  I would try removing the MaxPermGenSize option and see 
what happens.  You can also try reducing the ramBufferSizeMB in 
solrconfig.xml.  The default in previous versions of Solr was 32, which 
is big enough for most things, unless you are indexing HUGE documents 
like entire books.


It looks like you have the cache sizes under  at values close to 
default.  I wouldn't decrease the documentCache any - in fact an 
increase might be a good thing there.  As for the others, you could 
probably reduce them.  The filterCache size I would start at 64 or 128. 
 Watch your cache hitratios to see whether the changes make things 
remarkably worse.


If that doesn't help, try increasing the -Xmx option - first 3072m, next 
4096m.  You could go as high as 6GB and not run into any OS cache 
problems with your small index size, though you might run into long GC 
pauses.


Indexing, especially big documents, is fairly memory intensive.  Some 
queries can be memory intensive as well, especially those using facets 
or a lot of clauses.


Under normal operation, I could probably get away with a 3GB heap size, 
but I have it at 8GB because otherwise a full reindex (full-import from 
mysql) runs into OOM errors.


Thanks,
Shawn



Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

2013-01-25 Thread Rahul Bishnoi
Thanks for quick reply and addressing each point queried.

Additional asked information is mentioned below:

OS = Ubuntu 12.04 (64 bit)
Sun Java 7 (64 bit)
Total RAM = 8GB

SolrConfig.xml is available at http://pastebin.com/SEFxkw2R


Re: SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

2013-01-25 Thread Shawn Heisey

On 1/25/2013 4:49 AM, Harish Verma wrote:

we are testing solr 4.1 running inside tomcat 7 and java 7 with  following
options

JAVA_OPTS="-Xms256m -Xmx2048m -XX:MaxPermSize=1024m -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode -XX:+ParallelRefProcEnabled
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/ubuntu/OOM_HeapDump"

our source code looks like following:
/ START */
int noOfSolrDocumentsInBatch = 0;
for(int i=0 ; i<5000 ; i++) {
 SolrInputDocument solrInputDocument = getNextSolrInputDocument();
 server.add(solrInputDocument);
 noOfSolrDocumentsInBatch += 1;
 if(noOfSolrDocumentsInBatch == 10) {
 server.commit();
 noOfSolrDocumentsInBatch = 0;
 }
}
/ END */

the method "getNextSolrInputDocument()" generates a solr document with 100
fields (average). Around 50 of the fields are of "text_general" type.
Some of the "test_general" fields consist of approx 1000 words rest
consists of few words. Ouf of total fields there are around 35-40
multivalued fields (not of type "text_general").
We are indexing all the fields but storing only 8 fields. Out of these 8
fields two are string type, five are long and one is boolean. So our index
size is only 394 MB. But the RAM occupied at time of OOM is around 2.5 GB.
Why the memory is so high even though the index size is small?
What is being stored in the memory? Our understanding is that after every
commit documents are flushed to the disk.So nothing should remain in RAM
after commit.

We are using the following settings:

server.commit() set waitForSearcher=true and waitForFlush=true
solrConfig.xml has following properties set:
directoryFactory = solr.MMapDirectoryFactory
maxWarmingSearchers = 1
text_general data type is being used as supplied in the schema.xml with the
solr setup.
maxIndexingThreads = 8(default)
15000false

We get Java heap Out Of Memory Error after commiting around 3990 solr
documents.Some of the snapshots of memory dump from profiler are uploaded
at following links.
http://s9.postimage.org/w7589t9e7/memorydump1.png
http://s7.postimage.org/p3abs6nuj/memorydump2.png

can somebody please suggest what should we do to minimize/optimize the
memory consumption in our case with the reasons?
also suggest what should be optimal values and reason for following
parameters of solrConfig.xml
useColdSearcher - true/false?
 maxwarmingsearchers- number
 spellcheck-on/off?
 omitNorms=true/false?
 omitTermFreqAndPositions?
 mergefactor? we are using default value 10
 java garbage collection tuning parameters ?


Additional information is needed.  What OS platform?  Is the OS 64-bit?  
Is Java 64-bit?  How much total RAM?  We'll need your solrconfig.xml 
file, in particular the query and indexConfig sections. Use your 
favorite paste site (pastie.org, pastebin.com for example) to link the 
solrconfig.xml file.


General thoughts without the above information:

You are reserving half of your java heap for the permanent generation.  
I have a solr installation where Java has a max heap of 8GB, about 5GB 
of that is currently committed - actually allocated at the OS level.  My 
perm gen space is 65908KB.  This server handles a total index size of 
nearly 70GB.  I doubt you need 1GB for your perm gen size.


A 2GB heap is fairly small in the Solr world.  If you are using a 32 bit 
java, that's the biggest heap you can create, so 64-bit on both Java and 
OS is the way to go.  You can reduce memory requirements a small amount 
by using Jetty instead of Tomcat, but the difference is probably not big 
enough to really matter.


For the questions you asked at the end, most of them are personal 
preference, but maxWarmingSearchers should normally be kept low.  I 
think I have a value of 2 in my config.  Here are the GC tuning 
parameters that I am currently testing.  I have been having problems 
with long GC pauses that I am trying to fix:


-Xms1024M
-Xmx8192M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:NewRatio=3
-XX:MaxTenuringThreshold=8
-XX:+CMSParallelRemarkEnable

You should only use CMSIncrementalMode if you only have one or two 
processor cores.  My reading has suggested that when you have more, it 
is not beneficial.


So far my GC parameters seem to be working really well, but I need to do 
a full reindex which should force usage of the entire 8GB heap and push 
garbage collection to its limits.


I have a question of my own for someone familiar with the code. Does 
Solr extensively use weak references?  If so, ParallelRefProcEnabled 
might be a win.


Thanks,
Shawn



SOLR 4.1 Out Of Memory error After commit of a few thousand Solr Docs

2013-01-25 Thread Harish Verma
we are testing solr 4.1 running inside tomcat 7 and java 7 with  following
options

JAVA_OPTS="-Xms256m -Xmx2048m -XX:MaxPermSize=1024m -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode -XX:+ParallelRefProcEnabled
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/ubuntu/OOM_HeapDump"

our source code looks like following:
/ START */
int noOfSolrDocumentsInBatch = 0;
for(int i=0 ; i<5000 ; i++) {
SolrInputDocument solrInputDocument = getNextSolrInputDocument();
server.add(solrInputDocument);
noOfSolrDocumentsInBatch += 1;
if(noOfSolrDocumentsInBatch == 10) {
server.commit();
noOfSolrDocumentsInBatch = 0;
}
}
/ END */

the method "getNextSolrInputDocument()" generates a solr document with 100
fields (average). Around 50 of the fields are of "text_general" type.
Some of the "test_general" fields consist of approx 1000 words rest
consists of few words. Ouf of total fields there are around 35-40
multivalued fields (not of type "text_general").
We are indexing all the fields but storing only 8 fields. Out of these 8
fields two are string type, five are long and one is boolean. So our index
size is only 394 MB. But the RAM occupied at time of OOM is around 2.5 GB.
Why the memory is so high even though the index size is small?
What is being stored in the memory? Our understanding is that after every
commit documents are flushed to the disk.So nothing should remain in RAM
after commit.

We are using the following settings:

server.commit() set waitForSearcher=true and waitForFlush=true
solrConfig.xml has following properties set:
directoryFactory = solr.MMapDirectoryFactory
maxWarmingSearchers = 1
text_general data type is being used as supplied in the schema.xml with the
solr setup.
maxIndexingThreads = 8(default)
15000false

We get Java heap Out Of Memory Error after commiting around 3990 solr
documents.Some of the snapshots of memory dump from profiler are uploaded
at following links.
http://s9.postimage.org/w7589t9e7/memorydump1.png
http://s7.postimage.org/p3abs6nuj/memorydump2.png

can somebody please suggest what should we do to minimize/optimize the
memory consumption in our case with the reasons?
also suggest what should be optimal values and reason for following
parameters of solrConfig.xml
useColdSearcher - true/false?
maxwarmingsearchers- number
spellcheck-on/off?
omitNorms=true/false?
omitTermFreqAndPositions?
mergefactor? we are using default value 10
java garbage collection tuning parameters ?


Regards
Harish Verma