I am a little confused - how did 180k documents become 100m index documents?
We use have over 20 indices (for different content sets), one with 5m
documents (about a couple of pages each) and another with 100k+ docs.
We can index the 5m collection in a couple of days (limitation is in
the source) which is 100k documents an hour without breaking a sweat.



On 8/12/10, Rebecca Watson <bec.wat...@gmail.com> wrote:
> Hi,
>
> When indexing large amounts of data I hit a problem whereby Solr
> becomes unresponsive
> and doesn't recover (even when left overnight!). I think i've hit some
> GC problems/tuning
> is required of GC and I wanted to know if anyone has ever hit this problem.
> I can replicate this error (albeit taking longer to do so) using
> Solr/Lucene analysers
> only so I thought other people might have hit this issue before over
> large data sets....
>
> Background on my problem follows -- but I guess my main question is -- can
> Solr
> become so overwhelmed by update posts that it becomes completely
> unresponsive??
>
> Right now I think the problem is that the java GC is hanging but I've
> been working
> on this all week and it took a while to figure out it might be
> GC-based / wasn't a
> direct result of my custom analysers so i'd appreciate any advice anyone has
> about indexing large document collections.
>
> I also have a second questions for those in the know -- do we have a chance
> of indexing/searching over our large dataset with what little hardware
> we already
> have available??
>
> thanks in advance :)
>
> bec
>
> a bit of background: -------------------------------
>
> I've got a large collection of articles we want to index/search over
> -- about 180k
> in total. Each article has say 500-1000 sentences and each sentence has
> about
> 15 fields, many of which are multi-valued and we store most fields as well
> for
> display/highlighting purposes. So I'd guess over 100 million index
> documents.
>
> In our small test collection of 700 articles this results in a single index
> of
> about 13GB.
>
> Our pipeline processes PDF files through to Solr native xml which we call
> "index.xml" files i.e. in <add><doc>... format ready to post straight to
> Solr's
> update handler.
>
> We create the index.xml files as we pull in information from
> a few sources and creation of these files from their original PDF form is
> farmed out across a grid and is quite time-consuming so we distribute this
> process rather than creating index.xml files on the fly...
>
> We do a lot of linguistic processing and to enable search functionality
> of our resulting terms requires analysers that split terms/ join terms
> together
> i.e. custom analysers that perform string operations and are quite
> time-consuming/
> have large overhead compared to most analysers (they take approx
> 20-30% more time
> and use twice as many short-lived objects than the "text" field type).
>
> Right now i'm working on my new Imac:
> quad-core 2.8 GHz intel Core i7
> 16 GB 1067 MHz DDR3 RAM
> 2TB hard-drive (about half free)
> Version 10.6.4 OSX
>
> Production environment:
> 2 linux boxes each with:
> 8-core Intel(R) Xeon(R) CPU @ 2.00GHz
> 16GB RAM
>
> I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
> right now).
>
> I setup Solr to use autocommit as we'll have several document collections /
> post
> to Solr from different data sets:
>
>  <!-- autocommit pending docs if certain criteria are met.  Future
> versions may expand the available
>      criteria -->
>     <autoCommit>
>       <maxDocs>500000</maxDocs> <!-- every 1000 articles -->
>       <maxTime>900000</maxTime> <!-- every 15 minutes -->
>     </autoCommit>
>
> I also have
>   <useCompoundFile>false</useCompoundFile>
>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>     <mergeFactor>10</mergeFactor>
> -----------------
>
> *** First question:
> Has anyone else found that Solr hangs/becomes unresponsive after too
> many documents are indexed at once i.e. Solr can't keep up with the post
> rate?
>
> I've got LCF crawling my local test set (file system connection
> required only) and
> posting documents to Solr using 6GB of RAM. As I said above, these documents
> are in native Solr XML format (<add><doc>....) with one file per article so
> each
> <add> contains all the sentence-level documents for the article.
>
> With LCF I post about 2.5/3k articles (files) per hour -- so about
> 2.5k*500 /3600 =
> 350 <doc>s per second post-rate -- is this normal/expected??
>
> Eventually, after about 3000 files (an hour or so) Solr starts to
> hang/becomes
> unresponsive and with Jconsole/GC logging I can see that the Old-Gen space
> is
> about 90% full and the following is the end of the solr log file-- where you
> can see GC has been called:
> ------------------------------------------------------------------
> 3012.290: [GC Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 53349392
> Max   Chunk Size: 3200168
> Number of Blocks: 66
> Av.  Block  Size: 808324
> Tree      Height: 13
> Before GC:
> Statistics for BinaryTreeDictionary:
> ------------------------------------
> Total Free Space: 0
> Max   Chunk Size: 0
> Number of Blocks: 0
> Tree      Height: 0
> 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
> 0.0769802 secs]3012.367: [CMS
> ------------------------------------------------------------------
>
> I can replicate this with Solr using "text" field types in place of
> those that use my
> custom analysers -- whereby Solr takes longer to become unresponsive (about
> 3 hours / 13k docs) but there is the same kind of GC message at the end
>  of the log file / Jconsole shows that the Old-Gen space was almost full so
> was
> due for a collection sweep.
>
> I don't use any special GC settings but found an article here:
> http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/
>
> that suggests using particular GC settings for Solr -- I will try
> these but thought
> someone else could suggest another error source/give some GC advice??
>
> -----------------
>
> *** Second question:
>
> Given the production machines available for the Solr servers does it
> look like we've
> got enough hardware to produce reasonable query times / handle a few hundred
> queries per second??
>
> I planned on setting up one Solr server per machine (so two in total),
> each with 8GB
> of RAM -- so half of the 16GB available.
>
> We also have a third less powerful machine that houses all our data so
> I plan to setup LCF
> on that machine + post the files to the two Solr servers from this machine
> in
> the subnet.
>
> Does it sound like we might be able to achieve indexing/search over this
> little
> hardware (given around 100 million index documents i.e. approx 50 million
> each
> Solr server?).
>

-- 
Sent from my mobile device

Reply via email to