Hi Shalin, Yes we are as a matter of fact! We're preferring local replicas, but given the description of the bug, is it possible that that's forcing some other behaviour where - given equal shards - it will always route to the same shard? Not 100% sure if I understand it. That said, thank you, we'll try with Solr 8.6 and I'll report back.
Cheers Jonathan ' On Sat, Oct 24, 2020 at 11:37 PM Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > Hi Jonathan, > > Are you using the "shards.preference" parameter by any chance? There is a > bug that causes uneven request distribution during fan-out. Can you check > the number of requests using the /admin/metrics API? Look for the /select > handler's distrib and local request times for each core in the node. > Compare those across different nodes. > > The bug I refer to is https://issues.apache.org/jira/browse/SOLR-14471 and > it is fixed in Solr 8.5.2 > > On Fri, Oct 23, 2020 at 9:05 AM Jonathan Tan <jty....@gmail.com> wrote: > > > Hi, > > > > We've got a 3 node SolrCloud cluster running on GKE, each on their own > > kube node (which is in itself, relatively empty of other things). > > > > Our collection has ~18m documents of 36gb in size, split into 6 shards > > with 2 replicas each, and they are evenly distributed across the 3 nodes. > > Our JVMs are currently sized to ~14gb min & max , and they are running on > > SSDs. > > > > > > [image: Screen Shot 2020-10-23 at 2.15.48 pm.png] > > > > Graph also available here: https://pasteboard.co/JwUQ98M.png > > > > Under perf testing of ~30 requests per second, we start seeing really bad > > response times (around 3s in the 90th percentile, and *one* of the nodes > > would be fully maxed out on CPU. At about 15 requests per second, our > > response times are reasonable enough for our purposes (~0.8-1.1s), but as > > is visible in the graph, it's definitely *not* an even distribution of > the > > CPU load. One of the nodes is running at around 13cores, whilst the > other 2 > > are running at ~8cores and 6 cores respectively. > > > > We've tracked in our monitoring tools that the 3 nodes *are* getting an > > even distribution of requests, and we're using a Kube service which is in > > itself a fairly well known tool for load balancing pods. We've also used > > kube services heaps for load balancing of other apps and haven't seen > such > > a problem, so we doubt it's the load balancer that is the problem. > > > > All 3 nodes are built from the same kubernetes statefulset deployment so > > they'd all have the same configuration & setup. Additionally, over the > > course of the day, it may suddenly change so that an entirely different > > node is the one that is majorly overloaded on CPU. > > > > All this is happening only under queries, and we are doing no indexing at > > that time. > > > > We'd initially thought it might be the overseer that is being majorly > > overloaded when under queries (although we were surprised) until we did > > more testing and found that even the nodes that weren't overseer would > > sometimes have that disparity. We'd also tried using the `ADDROLE` API to > > force an overseer change in the middle of a test, and whilst the tree > > updated to show that the overseer had changed, it made no difference to > the > > highest CPU load. > > > > Directing queries directly to the non-busy nodes do actually give us back > > decent response times. > > > > We're quite puzzled by this and would really like some help figuring out > > *why* the CPU on one is so much higher. I did try to get the jaeger > tracing > > working (we already have jaeger in our cluster), but we just kept getting > > errors on startup with solr not being able to load the main function... > > > > > > Thank you in advance! > > Cheers > > Jonathan > > > > > > > > > > -- > Regards, > Shalin Shekhar Mangar. >