I am a little confused - how did 180k documents become 100m index documents? We use have over 20 indices (for different content sets), one with 5m documents (about a couple of pages each) and another with 100k+ docs. We can index the 5m collection in a couple of days (limitation is in the source) which is 100k documents an hour without breaking a sweat.
On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote: > Hi, > > When indexing large amounts of data I hit a problem whereby Solr > becomes unresponsive > and doesn't recover (even when left overnight!). I think i've hit some > GC problems/tuning > is required of GC and I wanted to know if anyone has ever hit this problem. > I can replicate this error (albeit taking longer to do so) using > Solr/Lucene analysers > only so I thought other people might have hit this issue before over > large data sets.... > > Background on my problem follows -- but I guess my main question is -- can > Solr > become so overwhelmed by update posts that it becomes completely > unresponsive?? > > Right now I think the problem is that the java GC is hanging but I've > been working > on this all week and it took a while to figure out it might be > GC-based / wasn't a > direct result of my custom analysers so i'd appreciate any advice anyone has > about indexing large document collections. > > I also have a second questions for those in the know -- do we have a chance > of indexing/searching over our large dataset with what little hardware > we already > have available?? > > thanks in advance :) > > bec > > a bit of background: ------------------------------- > > I've got a large collection of articles we want to index/search over > -- about 180k > in total. Each article has say 500-1000 sentences and each sentence has > about > 15 fields, many of which are multi-valued and we store most fields as well > for > display/highlighting purposes. So I'd guess over 100 million index > documents. > > In our small test collection of 700 articles this results in a single index > of > about 13GB. > > Our pipeline processes PDF files through to Solr native xml which we call > "index.xml" files i.e. in <add><doc>... format ready to post straight to > Solr's > update handler. > > We create the index.xml files as we pull in information from > a few sources and creation of these files from their original PDF form is > farmed out across a grid and is quite time-consuming so we distribute this > process rather than creating index.xml files on the fly... > > We do a lot of linguistic processing and to enable search functionality > of our resulting terms requires analysers that split terms/ join terms > together > i.e. custom analysers that perform string operations and are quite > time-consuming/ > have large overhead compared to most analysers (they take approx > 20-30% more time > and use twice as many short-lived objects than the "text" field type). > > Right now i'm working on my new Imac: > quad-core 2.8 GHz intel Core i7 > 16 GB 1067 MHz DDR3 RAM > 2TB hard-drive (about half free) > Version 10.6.4 OSX > > Production environment: > 2 linux boxes each with: > 8-core Intel(R) Xeon(R) CPU @ 2.00GHz > 16GB RAM > > I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core > right now). > > I setup Solr to use autocommit as we'll have several document collections / > post > to Solr from different data sets: > > <!-- autocommit pending docs if certain criteria are met. Future > versions may expand the available > criteria --> > <autoCommit> > <maxDocs>500000</maxDocs> <!-- every 1000 articles --> > <maxTime>900000</maxTime> <!-- every 15 minutes --> > </autoCommit> > > I also have > <useCompoundFile>false</useCompoundFile> > <ramBufferSizeMB>1024</ramBufferSizeMB> > <mergeFactor>10</mergeFactor> > ----------------- > > *** First question: > Has anyone else found that Solr hangs/becomes unresponsive after too > many documents are indexed at once i.e. Solr can't keep up with the post > rate? > > I've got LCF crawling my local test set (file system connection > required only) and > posting documents to Solr using 6GB of RAM. As I said above, these documents > are in native Solr XML format (<add><doc>....) with one file per article so > each > <add> contains all the sentence-level documents for the article. > > With LCF I post about 2.5/3k articles (files) per hour -- so about > 2.5k*500 /3600 = > 350 <doc>s per second post-rate -- is this normal/expected?? > > Eventually, after about 3000 files (an hour or so) Solr starts to > hang/becomes > unresponsive and with Jconsole/GC logging I can see that the Old-Gen space > is > about 90% full and the following is the end of the solr log file-- where you > can see GC has been called: > ------------------------------------------------------------------ > 3012.290: [GC Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 53349392 > Max Chunk Size: 3200168 > Number of Blocks: 66 > Av. Block Size: 808324 > Tree Height: 13 > Before GC: > Statistics for BinaryTreeDictionary: > ------------------------------------ > Total Free Space: 0 > Max Chunk Size: 0 > Number of Blocks: 0 > Tree Height: 0 > 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K), > 0.0769802 secs]3012.367: [CMS > ------------------------------------------------------------------ > > I can replicate this with Solr using "text" field types in place of > those that use my > custom analysers -- whereby Solr takes longer to become unresponsive (about > 3 hours / 13k docs) but there is the same kind of GC message at the end > of the log file / Jconsole shows that the Old-Gen space was almost full so > was > due for a collection sweep. > > I don't use any special GC settings but found an article here: > http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/ > > that suggests using particular GC settings for Solr -- I will try > these but thought > someone else could suggest another error source/give some GC advice?? > > ----------------- > > *** Second question: > > Given the production machines available for the Solr servers does it > look like we've > got enough hardware to produce reasonable query times / handle a few hundred > queries per second?? > > I planned on setting up one Solr server per machine (so two in total), > each with 8GB > of RAM -- so half of the 16GB available. > > We also have a third less powerful machine that houses all our data so > I plan to setup LCF > on that machine + post the files to the two Solr servers from this machine > in > the subnet. > > Does it sound like we might be able to achieve indexing/search over this > little > hardware (given around 100 million index documents i.e. approx 50 million > each > Solr server?). > -- Sent from my mobile device