Hi, When indexing large amounts of data I hit a problem whereby Solr becomes unresponsive and doesn't recover (even when left overnight!). I think i've hit some GC problems/tuning is required of GC and I wanted to know if anyone has ever hit this problem. I can replicate this error (albeit taking longer to do so) using Solr/Lucene analysers only so I thought other people might have hit this issue before over large data sets....
Background on my problem follows -- but I guess my main question is -- can Solr become so overwhelmed by update posts that it becomes completely unresponsive?? Right now I think the problem is that the java GC is hanging but I've been working on this all week and it took a while to figure out it might be GC-based / wasn't a direct result of my custom analysers so i'd appreciate any advice anyone has about indexing large document collections. I also have a second questions for those in the know -- do we have a chance of indexing/searching over our large dataset with what little hardware we already have available?? thanks in advance :) bec a bit of background: ------------------------------- I've got a large collection of articles we want to index/search over -- about 180k in total. Each article has say 500-1000 sentences and each sentence has about 15 fields, many of which are multi-valued and we store most fields as well for display/highlighting purposes. So I'd guess over 100 million index documents. In our small test collection of 700 articles this results in a single index of about 13GB. Our pipeline processes PDF files through to Solr native xml which we call "index.xml" files i.e. in <add><doc>... format ready to post straight to Solr's update handler. We create the index.xml files as we pull in information from a few sources and creation of these files from their original PDF form is farmed out across a grid and is quite time-consuming so we distribute this process rather than creating index.xml files on the fly... We do a lot of linguistic processing and to enable search functionality of our resulting terms requires analysers that split terms/ join terms together i.e. custom analysers that perform string operations and are quite time-consuming/ have large overhead compared to most analysers (they take approx 20-30% more time and use twice as many short-lived objects than the "text" field type). Right now i'm working on my new Imac: quad-core 2.8 GHz intel Core i7 16 GB 1067 MHz DDR3 RAM 2TB hard-drive (about half free) Version 10.6.4 OSX Production environment: 2 linux boxes each with: 8-core Intel(R) Xeon(R) CPU @ 2.00GHz 16GB RAM I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core right now). I setup Solr to use autocommit as we'll have several document collections / post to Solr from different data sets: <!-- autocommit pending docs if certain criteria are met. Future versions may expand the available criteria --> <autoCommit> <maxDocs>500000</maxDocs> <!-- every 1000 articles --> <maxTime>900000</maxTime> <!-- every 15 minutes --> </autoCommit> I also have <useCompoundFile>false</useCompoundFile> <ramBufferSizeMB>1024</ramBufferSizeMB> <mergeFactor>10</mergeFactor> ----------------- *** First question: Has anyone else found that Solr hangs/becomes unresponsive after too many documents are indexed at once i.e. Solr can't keep up with the post rate? I've got LCF crawling my local test set (file system connection required only) and posting documents to Solr using 6GB of RAM. As I said above, these documents are in native Solr XML format (<add><doc>....) with one file per article so each <add> contains all the sentence-level documents for the article. With LCF I post about 2.5/3k articles (files) per hour -- so about 2.5k*500 /3600 = 350 <doc>s per second post-rate -- is this normal/expected?? Eventually, after about 3000 files (an hour or so) Solr starts to hang/becomes unresponsive and with Jconsole/GC logging I can see that the Old-Gen space is about 90% full and the following is the end of the solr log file-- where you can see GC has been called: ------------------------------------------------------------------ 3012.290: [GC Before GC: Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 53349392 Max Chunk Size: 3200168 Number of Blocks: 66 Av. Block Size: 808324 Tree Height: 13 Before GC: Statistics for BinaryTreeDictionary: ------------------------------------ Total Free Space: 0 Max Chunk Size: 0 Number of Blocks: 0 Tree Height: 0 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K), 0.0769802 secs]3012.367: [CMS ------------------------------------------------------------------ I can replicate this with Solr using "text" field types in place of those that use my custom analysers -- whereby Solr takes longer to become unresponsive (about 3 hours / 13k docs) but there is the same kind of GC message at the end of the log file / Jconsole shows that the Old-Gen space was almost full so was due for a collection sweep. I don't use any special GC settings but found an article here: http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/ that suggests using particular GC settings for Solr -- I will try these but thought someone else could suggest another error source/give some GC advice?? ----------------- *** Second question: Given the production machines available for the Solr servers does it look like we've got enough hardware to produce reasonable query times / handle a few hundred queries per second?? I planned on setting up one Solr server per machine (so two in total), each with 8GB of RAM -- so half of the 16GB available. We also have a third less powerful machine that houses all our data so I plan to setup LCF on that machine + post the files to the two Solr servers from this machine in the subnet. Does it sound like we might be able to achieve indexing/search over this little hardware (given around 100 million index documents i.e. approx 50 million each Solr server?).