Hi Jonathan,

Are you using the "shards.preference" parameter by any chance? There is a
bug that causes uneven request distribution during fan-out. Can you check
the number of requests using the /admin/metrics API? Look for the /select
handler's distrib and local request times for each core in the node.
Compare those across different nodes.

The bug I refer to is https://issues.apache.org/jira/browse/SOLR-14471 and
it is fixed in Solr 8.5.2

On Fri, Oct 23, 2020 at 9:05 AM Jonathan Tan <jty....@gmail.com> wrote:

> Hi,
>
> We've got a 3 node SolrCloud cluster running on GKE, each on their own
> kube node (which is in itself, relatively empty of other things).
>
> Our collection has ~18m documents of 36gb in size, split into 6 shards
> with 2 replicas each, and they are evenly distributed across the 3 nodes.
> Our JVMs are currently sized to ~14gb min & max , and they are running on
> SSDs.
>
>
> [image: Screen Shot 2020-10-23 at 2.15.48 pm.png]
>
> Graph also available here: https://pasteboard.co/JwUQ98M.png
>
> Under perf testing of ~30 requests per second, we start seeing really bad
> response times (around 3s in the 90th percentile, and *one* of the nodes
> would be fully maxed out on CPU. At about 15 requests per second, our
> response times are reasonable enough for our purposes (~0.8-1.1s), but as
> is visible in the graph, it's definitely *not* an even distribution of the
> CPU load. One of the nodes is running at around 13cores, whilst the other 2
> are running at ~8cores and 6 cores respectively.
>
> We've tracked in our monitoring tools that the 3 nodes *are* getting an
> even distribution of requests, and we're using a Kube service which is in
> itself a fairly well known tool for load balancing pods. We've also used
> kube services heaps for load balancing of other apps and haven't seen such
> a problem, so we doubt it's the load balancer that is the problem.
>
> All 3 nodes are built from the same kubernetes statefulset deployment so
> they'd all have the same configuration & setup. Additionally, over the
> course of the day, it may suddenly change so that an entirely different
> node is the one that is majorly overloaded on CPU.
>
> All this is happening only under queries, and we are doing no indexing at
> that time.
>
> We'd initially thought it might be the overseer that is being majorly
> overloaded when under queries (although we were surprised) until we did
> more testing and found that even the nodes that weren't overseer would
> sometimes have that disparity. We'd also tried using the `ADDROLE` API to
> force an overseer change in the middle of a test, and whilst the tree
> updated to show that the overseer had changed, it made no difference to the
> highest CPU load.
>
> Directing queries directly to the non-busy nodes do actually give us back
> decent response times.
>
> We're quite puzzled by this and would really like some help figuring out
> *why* the CPU on one is so much higher. I did try to get the jaeger tracing
> working (we already have jaeger in our cluster), but we just kept getting
> errors on startup with solr not being able to load the main function...
>
>
> Thank you in advance!
> Cheers
> Jonathan
>
>
>
>

-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to