Re: Parallel indexing in Solr

Per Steffensen Mon, 06 Feb 2012 04:54:58 -0800

See response below

Erick Erickson skrev:

Unfortunately, the answer is "it depends(tm)".


First question: How are you indexing things? SolrJ? post.jar?

SolrJ, CommonsHttpSolrServer

But some observations:

1> sure, using multiple cores will have some parallelism. So will
    using a single core but using something like SolrJ and
    StreamingUpdateSolrServer.

So SolrJ with CommonsHttpSolrServer will not support handling severalrequests concurrently?

 Especially with trunk (4.0)
     and the Document Writer Per Thread stuff.

We are using trunk (4.0). Can you provide me with a little more info onthis "Document Writer Per Thread stuff". A link or something?

 In 3.x, you'll
     see some pauses when segments are merged that you
     can't get around (per core). See:
     
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
     for an excellent writeup. But whether or not you use several
     cores should be determined by your problem space, certainly
     not by trying to increase the throughput. Indexing usually
     take a back seat to search performance.

We will have few searches, but a lot of indexing.

2> general settings are hard to come by. If you're sending
      structured documents that use Tika to parse the data
      behind the scenes, your performance will be much
      different (slower) than sending SolrInputDocuments
     (SolrJ).

We are sending SolrInputDocuments

3> The recommended servlet container is, generally,
      "The one you're most comfortable with". Tomcat is
      certainly popular. That said, use whatever you're
      most comfortable with until you see a performance
     problem. Odds are you'll find your load on Solr is a
      at its limit before your servlet container has problems.

So Jetty in not a "easy to use, but non-performance"-container?

4> Monitor you CPU, fire more requests at it until it
     hits 100%. Note that there are occasions where the
    servlet container limits the number of outstanding
     requests it will allow and queues ones over that
     limit (find the magic setting to increase this if it's a
     problem, it differs by container). If you start to see
     your response times lengthen but the CPU not being
    fully utilized, that may be the cause.

Actually right now, I am trying to find our what my bottleneck is. Thesetup is more complex, than I would bother you with, but basically Ihave servers with 80-90% IO-wait and only 5-10% "real CPU usage". Itmight not be a Solr-related problem, I am investigating differentthings, but just wanted to know a little more about how Jetty/Solr worksin order to make a qualified guess.

5> How high is "high performance"? On a stock solr
     with the Wikipedia dump (11M docs), all running on
     my laptop, I see 7K docs/sec indexed. I know of
     installations that see 60 docs/sec or even less. I'm
    sending simple docs with SolrJ locally and they're
     sending huge documents over the wire that Tika
     handles. There are just so many variables it's hard
     to say anything except "try it and see"......

Well eventaually we need to be able to index and delete about 50miodocuments per day. We will need to keep a "history" of 2 years of datain our system, deletion will not start before we have been in productionfor 2 years. At that point in time the system needs to contain 2 year *365 days/year * 50mio docs/day = 36,5billion documents. At that point50mio documents need to be deleted and index per day - before that weonly need to index 50mio documents per day. We are aware that we areprobably going to need a certain amout of hardware for this, but mostimportant thing is that we make a scalable setup so that we can get tothis kind of numbers at all. Right now I am focusing on getting most outof one Solr instance potentially with several cores, though.

Best
Erick

On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen <st...@designware.dk> wrote:

Hi

This topic has probably been covered before, but I havnt had the luck to
find the answer.

We are running solr instances with several cores inside. Solr running
out-of-the-box on top of jetty. I believe jetty is receiving all the
http-requests about indexing ned documents, and forwards it to the solr
engine. What kind of parallelism does this setup provide. Can more than one
index-request get processed concurrently? How many? How to increase the
number of index-requests that can be handled in parallel? Will I get better
parallelism by running on another web-container than jetty - e.g. tomcat?
What is the recommended web-container for high performance production
systems?

Thanks!

Regards, Per Steffensen

Re: Parallel indexing in Solr

Reply via email to