Re: Ideas for debugging poor SolrCloud scalability

Peter Keegan Fri, 31 Oct 2014 12:44:48 -0700

Yes, I was inadvertently sending them to a replica. When I sent them to the
leader, the leader reported (1000 adds) and the replica reported only 1 add
per document. So, it looks like the leader forwards the batched jobs
individually to the replicas.


On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Internally, the docs are batched up into smaller buckets (10 as I
> remember) and forwarded to the correct shard leader. I suspect that's
> what you're seeing.
>
> Erick
>
> On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan <peterlkee...@gmail.com>
> wrote:
> > Regarding batch indexing:
> > When I send batches of 1000 docs to a standalone Solr server, the log
> file
> > reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the
> > leader of a replicated index, the leader log file reports much smaller
> > numbers, usually "(12 adds)". Why do the batches appear to be broken up?
> >
> > Peter
> >
> > On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> NP, just making sure.
> >>
> >> I suspect you'll get lots more bang for the buck, and
> >> results much more closely matching your expectations if
> >>
> >> 1> you batch up a bunch of docs at once rather than
> >> sending them one at a time. That's probably the easiest
> >> thing to try. Sending docs one at a time is something of
> >> an anti-pattern. I usually start with batches of 1,000.
> >>
> >> And just to check.. You're not issuing any commits from the
> >> client, right? Performance will be terrible if you issue commits
> >> after every doc, that's totally an anti-pattern. Doubly so for
> >> optimizes.... Since you showed us your solrconfig  autocommit
> >> settings I'm assuming not but want to be sure.
> >>
> >> 2> use a leader-aware client. I'm totally unfamiliar with Go,
> >> so I have no suggestions whatsoever to offer there.... But you'll
> >> want to batch in this case too.
> >>
> >> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose <ianr...@fullstory.com>
> wrote:
> >> > Hi Erick -
> >> >
> >> > Thanks for the detailed response and apologies for my confusing
> >> > terminology.  I should have said "WPS" (writes per second) instead of
> QPS
> >> > but I didn't want to introduce a weird new acronym since QPS is well
> >> > known.  Clearly a bad decision on my part.  To clarify: I am doing
> >> > *only* writes
> >> > (document adds).  Whenever I wrote "QPS" I was referring to writes.
> >> >
> >> > It seems clear at this point that I should wrap up the code to do
> "smart"
> >> > routing rather than choose Solr nodes randomly.  And then see if that
> >> > changes things.  I must admit that although I understand that random
> node
> >> > selection will impose a performance hit, theoretically it seems to me
> >> that
> >> > the system should still scale up as you add more nodes (albeit at
> lower
> >> > absolute level of performance than if you used a smart router).
> >> > Nonetheless, I'm just theorycrafting here so the better thing to do is
> >> just
> >> > try it experimentally.  I hope to have that working today - will
> report
> >> > back on my findings.
> >> >
> >> > Cheers,
> >> > - Ian
> >> >
> >> > p.s. To clarify why we are rolling our own smart router code, we use
> Go
> >> > over here rather than Java.  Although if we still get bad performance
> >> with
> >> > our custom Go router I may try a pure Java load client using
> >> > CloudSolrServer to eliminate the possibility of bugs in our
> >> implementation.
> >> >
> >> >
> >> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> I'm really confused:
> >> >>
> >> >> bq: I am not issuing any queries, only writes (document inserts)
> >> >>
> >> >> bq: It's clear that once the load test client has ~40 simulated users
> >> >>
> >> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support
> >> >> a higher QPS than 2 shards over 2 Solr nodes, right
> >> >>
> >> >> QPS is usually used to mean "Queries Per Second", which is different
> >> from
> >> >> the statement that "I am not issuing any queries....". And what do
> the
> >> >> number of users have to do with inserting documents?
> >> >>
> >> >> You also state: " In many cases, CPU on the solr servers is quite
> low as
> >> >> well"
> >> >>
> >> >> So let's talk about indexing first. Indexing should scale nearly
> >> >> linearly as long as
> >> >> 1> you are routing your docs to the correct leader, which happens
> with
> >> >> SolrJ
> >> >> and the CloudSolrSever automatically. Rather than rolling your own, I
> >> >> strongly
> >> >> suggest you try this out.
> >> >> 2> you have enough clients feeding the cluster to push CPU
> utilization
> >> >> on them all.
> >> >> Very often "slow indexing", or in your case "lack of scaling" is a
> >> >> result of document
> >> >> acquisition or, in your case, your doc generator is spending all it's
> >> >> time waiting for
> >> >> the individual documents to get to Solr and come back.
> >> >>
> >> >> bq: "chooses a random solr server for each ADD request (with 1 doc
> per
> >> add
> >> >> request)"
> >> >>
> >> >> Probably your culprit right there. Each and every document requires
> that
> >> >> you
> >> >> have to cross the network (and forward that doc to the correct
> leader).
> >> So
> >> >> given
> >> >> that you're not seeing high CPU utilization, I suspect that you're
> not
> >> >> sending
> >> >> enough docs to SolrCloud fast enough to see scaling. You need to
> batch
> >> up
> >> >> multiple docs, I generally send 1,000 docs at a time.
> >> >>
> >> >> But even if you do solve this, the inter-node routing will prevent
> >> >> linear scaling.
> >> >> When a doc (or a batch of docs) goes to a random Solr node, here's
> what
> >> >> happens:
> >> >> 1> the docs are re-packaged into groups based on which shard they're
> >> >> destined for
> >> >> 2> the sub-packets are forwarded to the leader for each shard
> >> >> 3> the responses are gathered back and returned to the client.
> >> >>
> >> >> This set of operations will eventually degrade the scaling.
> >> >>
> >> >> bq:  A cluster of 3 shards over 3 Solr nodes *should* support
> >> >> a higher QPS than 2 shards over 2 Solr nodes, right?  That's the
> whole
> >> idea
> >> >> behind sharding.
> >> >>
> >> >> If we're talking search requests, the answer is no. Sharding is
> >> >> what you do when your collection no longer fits on a single node.
> >> >> If it _does_ fit on a single node, then you'll usually get better
> query
> >> >> performance by adding a bunch of replicas to a single shard. When
> >> >> the number of  docs on each shard grows large enough that you
> >> >> no longer get good query performance, _then_ you shard. And
> >> >> take the query hit.
> >> >>
> >> >> If we're talking about inserts, then see above. I suspect your
> problem
> >> is
> >> >> that you're _not_ "saturating the SolrCloud cluster", you're sending
> >> >> docs to Solr very inefficiently and waiting on I/O. Batching docs and
> >> >> sending them to the right leader should scale pretty linearly until
> you
> >> >> start saturating your network.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose <ianr...@fullstory.com>
> >> wrote:
> >> >> > Thanks for the suggestions so for, all.
> >> >> >
> >> >> > 1) We are not using SolrJ on the client (not using Java at all)
> but I
> >> am
> >> >> > working on writing a "smart" router so that we can always send to
> the
> >> >> > correct node.  I am certainly curious to see how that changes
> things.
> >> >> > Nonetheless even with the overhead of extra routing hops, the
> observed
> >> >> > behavior (no increase in performance with more nodes) doesn't make
> any
> >> >> > sense to me.
> >> >> >
> >> >> > 2) Commits: we are using autoCommit with openSearcher=false
> >> >> (maxTime=60000)
> >> >> > and autoSoftCommit (maxTime=15000).
> >> >> >
> >> >> > 3) Suggestions to batch documents certainly make sense for
> production
> >> >> code
> >> >> > but in this case I am not real concerned with absolute
> performance; I
> >> >> just
> >> >> > want to see the *relative* performance as we use more Solr nodes.
> So
> >> I
> >> >> > don't think batching or not really matters.
> >> >> >
> >> >> > 4) "No, that won't affect indexing speed all that much.  The way to
> >> >> > increase indexing speed is to increase the number of processes or
> >> threads
> >> >> > that are indexing at the same time.  Instead of having one client
> >> >> > sending update
> >> >> > requests, try five of them."
> >> >> >
> >> >> > Can you elaborate on this some?  I'm worried I might be
> >> misunderstanding
> >> >> > something fundamental.  A cluster of 3 shards over 3 Solr nodes
> >> >> > *should* support
> >> >> > a higher QPS than 2 shards over 2 Solr nodes, right?  That's the
> whole
> >> >> idea
> >> >> > behind sharding.  Regarding your comment of "increase the number of
> >> >> > processes or threads", note that for each value of K (number of
> Solr
> >> >> nodes)
> >> >> > I measured with several different numbers of simulated users so
> that I
> >> >> > could find a "saturation point".  For example, take a look at my
> data
> >> for
> >> >> > K=2:
> >> >> >
> >> >> > Num NodesNum
> >> >> UsersQPS214722517902102290215285022029002403210260320028032102
> >> >> > 1003180
> >> >> >
> >> >> > It's clear that once the load test client has ~40 simulated users,
> the
> >> >> Solr
> >> >> > cluster is saturated.  Creating more users just increases the
> average
> >> >> > request latency, such that the total QPS remained (nearly)
> constant.
> >> So
> >> >> I
> >> >> > feel pretty confident that a cluster of size 2 *maxes out* at ~3200
> >> qps.
> >> >> > The problem is that I am finding roughly this same "max point", no
> >> matter
> >> >> > how many simulated users the load test client created, for any
> value
> >> of K
> >> >> > (> 1).
> >> >> >
> >> >> > Cheers,
> >> >> > - Ian
> >> >> >
> >> >> >
> >> >> > On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson <
> >> erickerick...@gmail.com
> >> >> >
> >> >> > wrote:
> >> >> >
> >> >> >> Your indexing client, if written in SolrJ, should use
> CloudSolrServer
> >> >> >> which is, in Matt's terms "leader aware". It divides up the
> >> >> >> documents to be indexed into packets that where each doc in
> >> >> >> the packet belongs on the same shard, and then sends the packet
> >> >> >> to the shard leader. This avoids a lot of re-routing and should
> >> >> >> scale essentially linearly. You may have to add more clients
> >> >> >> though, depending upon who hard the document-generator is
> >> >> >> working.
> >> >> >>
> >> >> >> Also, make sure that you send batches of documents as Shawn
> >> >> >> suggests, I use 1,000 as a starting point.
> >> >> >>
> >> >> >> Best,
> >> >> >> Erick
> >> >> >>
> >> >> >> On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey <
> apa...@elyograg.org>
> >> >> wrote:
> >> >> >> > On 10/30/2014 2:56 PM, Ian Rose wrote:
> >> >> >> >> I think this is true only for actual queries, right? I am not
> >> issuing
> >> >> >> >> any queries, only writes (document inserts). In the case of
> >> writes,
> >> >> >> >> increasing the number of shards should increase my throughput
> (in
> >> >> >> >> ops/sec) more or less linearly, right?
> >> >> >> >
> >> >> >> > No, that won't affect indexing speed all that much.  The way to
> >> >> increase
> >> >> >> > indexing speed is to increase the number of processes or threads
> >> that
> >> >> >> > are indexing at the same time.  Instead of having one client
> >> sending
> >> >> >> > update requests, try five of them.  Also, index many documents
> with
> >> >> each
> >> >> >> > update request.  Sending one document at a time is very
> >> inefficient.
> >> >> >> >
> >> >> >> > You didn't say how you're doing commits, but those need to be as
> >> >> >> > infrequent as you can manage.  Ideally, you would use autoCommit
> >> with
> >> >> >> > openSearcher=false on an interval of about five minutes, and
> send
> >> an
> >> >> >> > explicit commit (with the default openSearcher=true) after all
> the
> >> >> >> > indexing is done.
> >> >> >> >
> >> >> >> > You may have requirements regarding document visibility that
> this
> >> >> won't
> >> >> >> > satisfy, but try to avoid doing commits with openSearcher=true
> >> (soft
> >> >> >> > commits qualify for this) extremely frequently, like once a
> second.
> >> >> >> > Once a minute is much more realistic.  Opening a new searcher
> is an
> >> >> >> > expensive operation, especially if you have cache warming
> >> configured.
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Shawn
> >> >> >> >
> >> >> >>
> >> >>
> >>
>

Re: Ideas for debugging poor SolrCloud scalability

Reply via email to