Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing.
Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan <peterlkee...@gmail.com> wrote: > Regarding batch indexing: > When I send batches of 1000 docs to a standalone Solr server, the log file > reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the > leader of a replicated index, the leader log file reports much smaller > numbers, usually "(12 adds)". Why do the batches appear to be broken up? > > Peter > > On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> NP, just making sure. >> >> I suspect you'll get lots more bang for the buck, and >> results much more closely matching your expectations if >> >> 1> you batch up a bunch of docs at once rather than >> sending them one at a time. That's probably the easiest >> thing to try. Sending docs one at a time is something of >> an anti-pattern. I usually start with batches of 1,000. >> >> And just to check.. You're not issuing any commits from the >> client, right? Performance will be terrible if you issue commits >> after every doc, that's totally an anti-pattern. Doubly so for >> optimizes.... Since you showed us your solrconfig autocommit >> settings I'm assuming not but want to be sure. >> >> 2> use a leader-aware client. I'm totally unfamiliar with Go, >> so I have no suggestions whatsoever to offer there.... But you'll >> want to batch in this case too. >> >> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose <ianr...@fullstory.com> wrote: >> > Hi Erick - >> > >> > Thanks for the detailed response and apologies for my confusing >> > terminology. I should have said "WPS" (writes per second) instead of QPS >> > but I didn't want to introduce a weird new acronym since QPS is well >> > known. Clearly a bad decision on my part. To clarify: I am doing >> > *only* writes >> > (document adds). Whenever I wrote "QPS" I was referring to writes. >> > >> > It seems clear at this point that I should wrap up the code to do "smart" >> > routing rather than choose Solr nodes randomly. And then see if that >> > changes things. I must admit that although I understand that random node >> > selection will impose a performance hit, theoretically it seems to me >> that >> > the system should still scale up as you add more nodes (albeit at lower >> > absolute level of performance than if you used a smart router). >> > Nonetheless, I'm just theorycrafting here so the better thing to do is >> just >> > try it experimentally. I hope to have that working today - will report >> > back on my findings. >> > >> > Cheers, >> > - Ian >> > >> > p.s. To clarify why we are rolling our own smart router code, we use Go >> > over here rather than Java. Although if we still get bad performance >> with >> > our custom Go router I may try a pure Java load client using >> > CloudSolrServer to eliminate the possibility of bugs in our >> implementation. >> > >> > >> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson <erickerick...@gmail.com >> > >> > wrote: >> > >> >> I'm really confused: >> >> >> >> bq: I am not issuing any queries, only writes (document inserts) >> >> >> >> bq: It's clear that once the load test client has ~40 simulated users >> >> >> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support >> >> a higher QPS than 2 shards over 2 Solr nodes, right >> >> >> >> QPS is usually used to mean "Queries Per Second", which is different >> from >> >> the statement that "I am not issuing any queries....". And what do the >> >> number of users have to do with inserting documents? >> >> >> >> You also state: " In many cases, CPU on the solr servers is quite low as >> >> well" >> >> >> >> So let's talk about indexing first. Indexing should scale nearly >> >> linearly as long as >> >> 1> you are routing your docs to the correct leader, which happens with >> >> SolrJ >> >> and the CloudSolrSever automatically. Rather than rolling your own, I >> >> strongly >> >> suggest you try this out. >> >> 2> you have enough clients feeding the cluster to push CPU utilization >> >> on them all. >> >> Very often "slow indexing", or in your case "lack of scaling" is a >> >> result of document >> >> acquisition or, in your case, your doc generator is spending all it's >> >> time waiting for >> >> the individual documents to get to Solr and come back. >> >> >> >> bq: "chooses a random solr server for each ADD request (with 1 doc per >> add >> >> request)" >> >> >> >> Probably your culprit right there. Each and every document requires that >> >> you >> >> have to cross the network (and forward that doc to the correct leader). >> So >> >> given >> >> that you're not seeing high CPU utilization, I suspect that you're not >> >> sending >> >> enough docs to SolrCloud fast enough to see scaling. You need to batch >> up >> >> multiple docs, I generally send 1,000 docs at a time. >> >> >> >> But even if you do solve this, the inter-node routing will prevent >> >> linear scaling. >> >> When a doc (or a batch of docs) goes to a random Solr node, here's what >> >> happens: >> >> 1> the docs are re-packaged into groups based on which shard they're >> >> destined for >> >> 2> the sub-packets are forwarded to the leader for each shard >> >> 3> the responses are gathered back and returned to the client. >> >> >> >> This set of operations will eventually degrade the scaling. >> >> >> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support >> >> a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole >> idea >> >> behind sharding. >> >> >> >> If we're talking search requests, the answer is no. Sharding is >> >> what you do when your collection no longer fits on a single node. >> >> If it _does_ fit on a single node, then you'll usually get better query >> >> performance by adding a bunch of replicas to a single shard. When >> >> the number of docs on each shard grows large enough that you >> >> no longer get good query performance, _then_ you shard. And >> >> take the query hit. >> >> >> >> If we're talking about inserts, then see above. I suspect your problem >> is >> >> that you're _not_ "saturating the SolrCloud cluster", you're sending >> >> docs to Solr very inefficiently and waiting on I/O. Batching docs and >> >> sending them to the right leader should scale pretty linearly until you >> >> start saturating your network. >> >> >> >> Best, >> >> Erick >> >> >> >> On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose <ianr...@fullstory.com> >> wrote: >> >> > Thanks for the suggestions so for, all. >> >> > >> >> > 1) We are not using SolrJ on the client (not using Java at all) but I >> am >> >> > working on writing a "smart" router so that we can always send to the >> >> > correct node. I am certainly curious to see how that changes things. >> >> > Nonetheless even with the overhead of extra routing hops, the observed >> >> > behavior (no increase in performance with more nodes) doesn't make any >> >> > sense to me. >> >> > >> >> > 2) Commits: we are using autoCommit with openSearcher=false >> >> (maxTime=60000) >> >> > and autoSoftCommit (maxTime=15000). >> >> > >> >> > 3) Suggestions to batch documents certainly make sense for production >> >> code >> >> > but in this case I am not real concerned with absolute performance; I >> >> just >> >> > want to see the *relative* performance as we use more Solr nodes. So >> I >> >> > don't think batching or not really matters. >> >> > >> >> > 4) "No, that won't affect indexing speed all that much. The way to >> >> > increase indexing speed is to increase the number of processes or >> threads >> >> > that are indexing at the same time. Instead of having one client >> >> > sending update >> >> > requests, try five of them." >> >> > >> >> > Can you elaborate on this some? I'm worried I might be >> misunderstanding >> >> > something fundamental. A cluster of 3 shards over 3 Solr nodes >> >> > *should* support >> >> > a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole >> >> idea >> >> > behind sharding. Regarding your comment of "increase the number of >> >> > processes or threads", note that for each value of K (number of Solr >> >> nodes) >> >> > I measured with several different numbers of simulated users so that I >> >> > could find a "saturation point". For example, take a look at my data >> for >> >> > K=2: >> >> > >> >> > Num NodesNum >> >> UsersQPS214722517902102290215285022029002403210260320028032102 >> >> > 1003180 >> >> > >> >> > It's clear that once the load test client has ~40 simulated users, the >> >> Solr >> >> > cluster is saturated. Creating more users just increases the average >> >> > request latency, such that the total QPS remained (nearly) constant. >> So >> >> I >> >> > feel pretty confident that a cluster of size 2 *maxes out* at ~3200 >> qps. >> >> > The problem is that I am finding roughly this same "max point", no >> matter >> >> > how many simulated users the load test client created, for any value >> of K >> >> > (> 1). >> >> > >> >> > Cheers, >> >> > - Ian >> >> > >> >> > >> >> > On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson < >> erickerick...@gmail.com >> >> > >> >> > wrote: >> >> > >> >> >> Your indexing client, if written in SolrJ, should use CloudSolrServer >> >> >> which is, in Matt's terms "leader aware". It divides up the >> >> >> documents to be indexed into packets that where each doc in >> >> >> the packet belongs on the same shard, and then sends the packet >> >> >> to the shard leader. This avoids a lot of re-routing and should >> >> >> scale essentially linearly. You may have to add more clients >> >> >> though, depending upon who hard the document-generator is >> >> >> working. >> >> >> >> >> >> Also, make sure that you send batches of documents as Shawn >> >> >> suggests, I use 1,000 as a starting point. >> >> >> >> >> >> Best, >> >> >> Erick >> >> >> >> >> >> On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey <apa...@elyograg.org> >> >> wrote: >> >> >> > On 10/30/2014 2:56 PM, Ian Rose wrote: >> >> >> >> I think this is true only for actual queries, right? I am not >> issuing >> >> >> >> any queries, only writes (document inserts). In the case of >> writes, >> >> >> >> increasing the number of shards should increase my throughput (in >> >> >> >> ops/sec) more or less linearly, right? >> >> >> > >> >> >> > No, that won't affect indexing speed all that much. The way to >> >> increase >> >> >> > indexing speed is to increase the number of processes or threads >> that >> >> >> > are indexing at the same time. Instead of having one client >> sending >> >> >> > update requests, try five of them. Also, index many documents with >> >> each >> >> >> > update request. Sending one document at a time is very >> inefficient. >> >> >> > >> >> >> > You didn't say how you're doing commits, but those need to be as >> >> >> > infrequent as you can manage. Ideally, you would use autoCommit >> with >> >> >> > openSearcher=false on an interval of about five minutes, and send >> an >> >> >> > explicit commit (with the default openSearcher=true) after all the >> >> >> > indexing is done. >> >> >> > >> >> >> > You may have requirements regarding document visibility that this >> >> won't >> >> >> > satisfy, but try to avoid doing commits with openSearcher=true >> (soft >> >> >> > commits qualify for this) extremely frequently, like once a second. >> >> >> > Once a minute is much more realistic. Opening a new searcher is an >> >> >> > expensive operation, especially if you have cache warming >> configured. >> >> >> > >> >> >> > Thanks, >> >> >> > Shawn >> >> >> > >> >> >> >> >> >>