Re: Ideas for debugging poor SolrCloud scalability

Erick Erickson Fri, 31 Oct 2014 12:29:47 -0700

Internally, the docs are batched up into smaller buckets (10 as I
remember) and forwarded to the correct shard leader. I suspect that's
what you're seeing.


Erick

On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan <peterlkee...@gmail.com> wrote:
> Regarding batch indexing:
> When I send batches of 1000 docs to a standalone Solr server, the log file
> reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the
> leader of a replicated index, the leader log file reports much smaller
> numbers, usually "(12 adds)". Why do the batches appear to be broken up?
>
> Peter
>
> On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> NP, just making sure.
>>
>> I suspect you'll get lots more bang for the buck, and
>> results much more closely matching your expectations if
>>
>> 1> you batch up a bunch of docs at once rather than
>> sending them one at a time. That's probably the easiest
>> thing to try. Sending docs one at a time is something of
>> an anti-pattern. I usually start with batches of 1,000.
>>
>> And just to check.. You're not issuing any commits from the
>> client, right? Performance will be terrible if you issue commits
>> after every doc, that's totally an anti-pattern. Doubly so for
>> optimizes.... Since you showed us your solrconfig  autocommit
>> settings I'm assuming not but want to be sure.
>>
>> 2> use a leader-aware client. I'm totally unfamiliar with Go,
>> so I have no suggestions whatsoever to offer there.... But you'll
>> want to batch in this case too.
>>
>> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose <ianr...@fullstory.com> wrote:
>> > Hi Erick -
>> >
>> > Thanks for the detailed response and apologies for my confusing
>> > terminology.  I should have said "WPS" (writes per second) instead of QPS
>> > but I didn't want to introduce a weird new acronym since QPS is well
>> > known.  Clearly a bad decision on my part.  To clarify: I am doing
>> > *only* writes
>> > (document adds).  Whenever I wrote "QPS" I was referring to writes.
>> >
>> > It seems clear at this point that I should wrap up the code to do "smart"
>> > routing rather than choose Solr nodes randomly.  And then see if that
>> > changes things.  I must admit that although I understand that random node
>> > selection will impose a performance hit, theoretically it seems to me
>> that
>> > the system should still scale up as you add more nodes (albeit at lower
>> > absolute level of performance than if you used a smart router).
>> > Nonetheless, I'm just theorycrafting here so the better thing to do is
>> just
>> > try it experimentally.  I hope to have that working today - will report
>> > back on my findings.
>> >
>> > Cheers,
>> > - Ian
>> >
>> > p.s. To clarify why we are rolling our own smart router code, we use Go
>> > over here rather than Java.  Although if we still get bad performance
>> with
>> > our custom Go router I may try a pure Java load client using
>> > CloudSolrServer to eliminate the possibility of bugs in our
>> implementation.
>> >
>> >
>> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson <erickerick...@gmail.com
>> >
>> > wrote:
>> >
>> >> I'm really confused:
>> >>
>> >> bq: I am not issuing any queries, only writes (document inserts)
>> >>
>> >> bq: It's clear that once the load test client has ~40 simulated users
>> >>
>> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support
>> >> a higher QPS than 2 shards over 2 Solr nodes, right
>> >>
>> >> QPS is usually used to mean "Queries Per Second", which is different
>> from
>> >> the statement that "I am not issuing any queries....". And what do the
>> >> number of users have to do with inserting documents?
>> >>
>> >> You also state: " In many cases, CPU on the solr servers is quite low as
>> >> well"
>> >>
>> >> So let's talk about indexing first. Indexing should scale nearly
>> >> linearly as long as
>> >> 1> you are routing your docs to the correct leader, which happens with
>> >> SolrJ
>> >> and the CloudSolrSever automatically. Rather than rolling your own, I
>> >> strongly
>> >> suggest you try this out.
>> >> 2> you have enough clients feeding the cluster to push CPU utilization
>> >> on them all.
>> >> Very often "slow indexing", or in your case "lack of scaling" is a
>> >> result of document
>> >> acquisition or, in your case, your doc generator is spending all it's
>> >> time waiting for
>> >> the individual documents to get to Solr and come back.
>> >>
>> >> bq: "chooses a random solr server for each ADD request (with 1 doc per
>> add
>> >> request)"
>> >>
>> >> Probably your culprit right there. Each and every document requires that
>> >> you
>> >> have to cross the network (and forward that doc to the correct leader).
>> So
>> >> given
>> >> that you're not seeing high CPU utilization, I suspect that you're not
>> >> sending
>> >> enough docs to SolrCloud fast enough to see scaling. You need to batch
>> up
>> >> multiple docs, I generally send 1,000 docs at a time.
>> >>
>> >> But even if you do solve this, the inter-node routing will prevent
>> >> linear scaling.
>> >> When a doc (or a batch of docs) goes to a random Solr node, here's what
>> >> happens:
>> >> 1> the docs are re-packaged into groups based on which shard they're
>> >> destined for
>> >> 2> the sub-packets are forwarded to the leader for each shard
>> >> 3> the responses are gathered back and returned to the client.
>> >>
>> >> This set of operations will eventually degrade the scaling.
>> >>
>> >> bq:  A cluster of 3 shards over 3 Solr nodes *should* support
>> >> a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole
>> idea
>> >> behind sharding.
>> >>
>> >> If we're talking search requests, the answer is no. Sharding is
>> >> what you do when your collection no longer fits on a single node.
>> >> If it _does_ fit on a single node, then you'll usually get better query
>> >> performance by adding a bunch of replicas to a single shard. When
>> >> the number of  docs on each shard grows large enough that you
>> >> no longer get good query performance, _then_ you shard. And
>> >> take the query hit.
>> >>
>> >> If we're talking about inserts, then see above. I suspect your problem
>> is
>> >> that you're _not_ "saturating the SolrCloud cluster", you're sending
>> >> docs to Solr very inefficiently and waiting on I/O. Batching docs and
>> >> sending them to the right leader should scale pretty linearly until you
>> >> start saturating your network.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose <ianr...@fullstory.com>
>> wrote:
>> >> > Thanks for the suggestions so for, all.
>> >> >
>> >> > 1) We are not using SolrJ on the client (not using Java at all) but I
>> am
>> >> > working on writing a "smart" router so that we can always send to the
>> >> > correct node.  I am certainly curious to see how that changes things.
>> >> > Nonetheless even with the overhead of extra routing hops, the observed
>> >> > behavior (no increase in performance with more nodes) doesn't make any
>> >> > sense to me.
>> >> >
>> >> > 2) Commits: we are using autoCommit with openSearcher=false
>> >> (maxTime=60000)
>> >> > and autoSoftCommit (maxTime=15000).
>> >> >
>> >> > 3) Suggestions to batch documents certainly make sense for production
>> >> code
>> >> > but in this case I am not real concerned with absolute performance; I
>> >> just
>> >> > want to see the *relative* performance as we use more Solr nodes.  So
>> I
>> >> > don't think batching or not really matters.
>> >> >
>> >> > 4) "No, that won't affect indexing speed all that much.  The way to
>> >> > increase indexing speed is to increase the number of processes or
>> threads
>> >> > that are indexing at the same time.  Instead of having one client
>> >> > sending update
>> >> > requests, try five of them."
>> >> >
>> >> > Can you elaborate on this some?  I'm worried I might be
>> misunderstanding
>> >> > something fundamental.  A cluster of 3 shards over 3 Solr nodes
>> >> > *should* support
>> >> > a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole
>> >> idea
>> >> > behind sharding.  Regarding your comment of "increase the number of
>> >> > processes or threads", note that for each value of K (number of Solr
>> >> nodes)
>> >> > I measured with several different numbers of simulated users so that I
>> >> > could find a "saturation point".  For example, take a look at my data
>> for
>> >> > K=2:
>> >> >
>> >> > Num NodesNum
>> >> UsersQPS214722517902102290215285022029002403210260320028032102
>> >> > 1003180
>> >> >
>> >> > It's clear that once the load test client has ~40 simulated users, the
>> >> Solr
>> >> > cluster is saturated.  Creating more users just increases the average
>> >> > request latency, such that the total QPS remained (nearly) constant.
>> So
>> >> I
>> >> > feel pretty confident that a cluster of size 2 *maxes out* at ~3200
>> qps.
>> >> > The problem is that I am finding roughly this same "max point", no
>> matter
>> >> > how many simulated users the load test client created, for any value
>> of K
>> >> > (> 1).
>> >> >
>> >> > Cheers,
>> >> > - Ian
>> >> >
>> >> >
>> >> > On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> Your indexing client, if written in SolrJ, should use CloudSolrServer
>> >> >> which is, in Matt's terms "leader aware". It divides up the
>> >> >> documents to be indexed into packets that where each doc in
>> >> >> the packet belongs on the same shard, and then sends the packet
>> >> >> to the shard leader. This avoids a lot of re-routing and should
>> >> >> scale essentially linearly. You may have to add more clients
>> >> >> though, depending upon who hard the document-generator is
>> >> >> working.
>> >> >>
>> >> >> Also, make sure that you send batches of documents as Shawn
>> >> >> suggests, I use 1,000 as a starting point.
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >> On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey <apa...@elyograg.org>
>> >> wrote:
>> >> >> > On 10/30/2014 2:56 PM, Ian Rose wrote:
>> >> >> >> I think this is true only for actual queries, right? I am not
>> issuing
>> >> >> >> any queries, only writes (document inserts). In the case of
>> writes,
>> >> >> >> increasing the number of shards should increase my throughput (in
>> >> >> >> ops/sec) more or less linearly, right?
>> >> >> >
>> >> >> > No, that won't affect indexing speed all that much.  The way to
>> >> increase
>> >> >> > indexing speed is to increase the number of processes or threads
>> that
>> >> >> > are indexing at the same time.  Instead of having one client
>> sending
>> >> >> > update requests, try five of them.  Also, index many documents with
>> >> each
>> >> >> > update request.  Sending one document at a time is very
>> inefficient.
>> >> >> >
>> >> >> > You didn't say how you're doing commits, but those need to be as
>> >> >> > infrequent as you can manage.  Ideally, you would use autoCommit
>> with
>> >> >> > openSearcher=false on an interval of about five minutes, and send
>> an
>> >> >> > explicit commit (with the default openSearcher=true) after all the
>> >> >> > indexing is done.
>> >> >> >
>> >> >> > You may have requirements regarding document visibility that this
>> >> won't
>> >> >> > satisfy, but try to avoid doing commits with openSearcher=true
>> (soft
>> >> >> > commits qualify for this) extremely frequently, like once a second.
>> >> >> > Once a minute is much more realistic.  Opening a new searcher is an
>> >> >> > expensive operation, especially if you have cache warming
>> configured.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Shawn
>> >> >> >
>> >> >>
>> >>
>>

Re: Ideas for debugging poor SolrCloud scalability

Reply via email to