Hi,

When indexing large amounts of data I hit a problem whereby Solr
becomes unresponsive
and doesn't recover (even when left overnight!). I think i've hit some
GC problems/tuning
is required of GC and I wanted to know if anyone has ever hit this problem.
I can replicate this error (albeit taking longer to do so) using
Solr/Lucene analysers
only so I thought other people might have hit this issue before over
large data sets....

Background on my problem follows -- but I guess my main question is -- can Solr
become so overwhelmed by update posts that it becomes completely unresponsive??

Right now I think the problem is that the java GC is hanging but I've
been working
on this all week and it took a while to figure out it might be
GC-based / wasn't a
direct result of my custom analysers so i'd appreciate any advice anyone has
about indexing large document collections.

I also have a second questions for those in the know -- do we have a chance
of indexing/searching over our large dataset with what little hardware
we already
have available??

thanks in advance :)

bec

a bit of background: -------------------------------

I've got a large collection of articles we want to index/search over
-- about 180k
in total. Each article has say 500-1000 sentences and each sentence has about
15 fields, many of which are multi-valued and we store most fields as well for
display/highlighting purposes. So I'd guess over 100 million index documents.

In our small test collection of 700 articles this results in a single index of
about 13GB.

Our pipeline processes PDF files through to Solr native xml which we call
"index.xml" files i.e. in <add><doc>... format ready to post straight to Solr's
update handler.

We create the index.xml files as we pull in information from
a few sources and creation of these files from their original PDF form is
farmed out across a grid and is quite time-consuming so we distribute this
process rather than creating index.xml files on the fly...

We do a lot of linguistic processing and to enable search functionality
of our resulting terms requires analysers that split terms/ join terms together
i.e. custom analysers that perform string operations and are quite
time-consuming/
have large overhead compared to most analysers (they take approx
20-30% more time
and use twice as many short-lived objects than the "text" field type).

Right now i'm working on my new Imac:
quad-core 2.8 GHz intel Core i7
16 GB 1067 MHz DDR3 RAM
2TB hard-drive (about half free)
Version 10.6.4 OSX

Production environment:
2 linux boxes each with:
8-core Intel(R) Xeon(R) CPU @ 2.00GHz
16GB RAM

I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core
right now).

I setup Solr to use autocommit as we'll have several document collections / post
to Solr from different data sets:

 <!-- autocommit pending docs if certain criteria are met.  Future
versions may expand the available
     criteria -->
    <autoCommit>
      <maxDocs>500000</maxDocs> <!-- every 1000 articles -->
      <maxTime>900000</maxTime> <!-- every 15 minutes -->
    </autoCommit>

I also have
  <useCompoundFile>false</useCompoundFile>
    <ramBufferSizeMB>1024</ramBufferSizeMB>
    <mergeFactor>10</mergeFactor>
-----------------

*** First question:
Has anyone else found that Solr hangs/becomes unresponsive after too
many documents are indexed at once i.e. Solr can't keep up with the post rate?

I've got LCF crawling my local test set (file system connection
required only) and
posting documents to Solr using 6GB of RAM. As I said above, these documents
are in native Solr XML format (<add><doc>....) with one file per article so each
<add> contains all the sentence-level documents for the article.

With LCF I post about 2.5/3k articles (files) per hour -- so about
2.5k*500 /3600 =
350 <doc>s per second post-rate -- is this normal/expected??

Eventually, after about 3000 files (an hour or so) Solr starts to hang/becomes
unresponsive and with Jconsole/GC logging I can see that the Old-Gen space is
about 90% full and the following is the end of the solr log file-- where you
can see GC has been called:
------------------------------------------------------------------
3012.290: [GC Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 53349392
Max   Chunk Size: 3200168
Number of Blocks: 66
Av.  Block  Size: 808324
Tree      Height: 13
Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree      Height: 0
3012.290: [ParNew (promotion failed): 143071K->142663K(153344K),
0.0769802 secs]3012.367: [CMS
------------------------------------------------------------------

I can replicate this with Solr using "text" field types in place of
those that use my
custom analysers -- whereby Solr takes longer to become unresponsive (about
3 hours / 13k docs) but there is the same kind of GC message at the end
 of the log file / Jconsole shows that the Old-Gen space was almost full so was
due for a collection sweep.

I don't use any special GC settings but found an article here:
http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/

that suggests using particular GC settings for Solr -- I will try
these but thought
someone else could suggest another error source/give some GC advice??

-----------------

*** Second question:

Given the production machines available for the Solr servers does it
look like we've
got enough hardware to produce reasonable query times / handle a few hundred
queries per second??

I planned on setting up one Solr server per machine (so two in total),
each with 8GB
of RAM -- so half of the 16GB available.

We also have a third less powerful machine that houses all our data so
I plan to setup LCF
on that machine + post the files to the two Solr servers from this machine in
the subnet.

Does it sound like we might be able to achieve indexing/search over this little
hardware (given around 100 million index documents i.e. approx 50 million each
Solr server?).

Reply via email to