Good to hear that. Thanks for closing the loop!

On Tue, Oct 27, 2020 at 11:14 AM Jonathan Tan <jty....@gmail.com> wrote:

> Hi Shalin,
>
> Moving to 8.6.3 fixed it!
>
> Thank you very much for that. :)
> We'd considered an upgrade - just because - but we won't have done so so
> quickly without your information.
>
> Cheers
>
> On Sat, Oct 24, 2020 at 11:37 PM Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
> > Hi Jonathan,
> >
> > Are you using the "shards.preference" parameter by any chance? There is a
> > bug that causes uneven request distribution during fan-out. Can you check
> > the number of requests using the /admin/metrics API? Look for the /select
> > handler's distrib and local request times for each core in the node.
> > Compare those across different nodes.
> >
> > The bug I refer to is https://issues.apache.org/jira/browse/SOLR-14471
> and
> > it is fixed in Solr 8.5.2
> >
> > On Fri, Oct 23, 2020 at 9:05 AM Jonathan Tan <jty....@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > We've got a 3 node SolrCloud cluster running on GKE, each on their own
> > > kube node (which is in itself, relatively empty of other things).
> > >
> > > Our collection has ~18m documents of 36gb in size, split into 6 shards
> > > with 2 replicas each, and they are evenly distributed across the 3
> nodes.
> > > Our JVMs are currently sized to ~14gb min & max , and they are running
> on
> > > SSDs.
> > >
> > >
> > > [image: Screen Shot 2020-10-23 at 2.15.48 pm.png]
> > >
> > > Graph also available here: https://pasteboard.co/JwUQ98M.png
> > >
> > > Under perf testing of ~30 requests per second, we start seeing really
> bad
> > > response times (around 3s in the 90th percentile, and *one* of the
> nodes
> > > would be fully maxed out on CPU. At about 15 requests per second, our
> > > response times are reasonable enough for our purposes (~0.8-1.1s), but
> as
> > > is visible in the graph, it's definitely *not* an even distribution of
> > the
> > > CPU load. One of the nodes is running at around 13cores, whilst the
> > other 2
> > > are running at ~8cores and 6 cores respectively.
> > >
> > > We've tracked in our monitoring tools that the 3 nodes *are* getting an
> > > even distribution of requests, and we're using a Kube service which is
> in
> > > itself a fairly well known tool for load balancing pods. We've also
> used
> > > kube services heaps for load balancing of other apps and haven't seen
> > such
> > > a problem, so we doubt it's the load balancer that is the problem.
> > >
> > > All 3 nodes are built from the same kubernetes statefulset deployment
> so
> > > they'd all have the same configuration & setup. Additionally, over the
> > > course of the day, it may suddenly change so that an entirely different
> > > node is the one that is majorly overloaded on CPU.
> > >
> > > All this is happening only under queries, and we are doing no indexing
> at
> > > that time.
> > >
> > > We'd initially thought it might be the overseer that is being majorly
> > > overloaded when under queries (although we were surprised) until we did
> > > more testing and found that even the nodes that weren't overseer would
> > > sometimes have that disparity. We'd also tried using the `ADDROLE` API
> to
> > > force an overseer change in the middle of a test, and whilst the tree
> > > updated to show that the overseer had changed, it made no difference to
> > the
> > > highest CPU load.
> > >
> > > Directing queries directly to the non-busy nodes do actually give us
> back
> > > decent response times.
> > >
> > > We're quite puzzled by this and would really like some help figuring
> out
> > > *why* the CPU on one is so much higher. I did try to get the jaeger
> > tracing
> > > working (we already have jaeger in our cluster), but we just kept
> getting
> > > errors on startup with solr not being able to load the main function...
> > >
> > >
> > > Thank you in advance!
> > > Cheers
> > > Jonathan
> > >
> > >
> > >
> > >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>


-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to