Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Jack Schlederer Thu, 05 Sep 2019 09:16:33 -0700

My mistake on the link, which should be this:
https://lucene.apache.org/solr/guide/7_1/solrcloud-autoscaling-auto-add-replicas.html#implementation-using-autoaddreplicas-trigger


--Jack

On Thu, Sep 5, 2019 at 11:02 AM Jack Schlederer <schleder...@gmail.com>
wrote:

> I'd defer to the committers if they have any further advice, but you might
> have to suspend the autoAddReplicas trigger through the autoscaling API (
> https://solr.stage.ecommerce.sandbox.directsupply-sandbox.cloud:8985/solr/ )
> if you set up your collections with autoAddReplicas enabled. Then, the
> system will not try to re-create missing replicas.
>
> Just another note on your setup-- It seems to me like using only 3 nodes
> for 168 GB worth of indices isn't making the most of SolrCloud, which
> provides the capabilities for sharding indices across a high number of
> nodes. Just a data point for you to consider when considering your cluster
> sizing, my org is running only about 50GB of indices, but we run it over 35
> nodes with 8GB of heap apiece, each collection with 2+ shards.
>
> --Jack
>
> On Thu, Sep 5, 2019 at 8:47 AM Doss <itsmed...@gmail.com> wrote:
>
>> Thanks Eric for the explanation. Sum of all our index size is about 138
>> GB,
>> only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will
>> require at least couple of days, till that time is there any option to
>> control the replication method?
>>
>> Thanks,
>> Doss.
>>
>> On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>> > You say you have three nodes, 130 replicas and a replication factor of
>> 3,
>> > so
>> > you have 130 cores/node. At least one of those cores has a 20G index,
>> > right?
>> >
>> > What is the sum of all the indexes on a single physical machine?
>> >
>> > I think your system is under-provisioned and that you’ve been riding at
>> > the edge
>> > of instability for quite some time and have added enough more docs that
>> > you finally reached a tipping point. But that’s largely speculation.
>> >
>> > So adding more heap may help. But Real Soon Now you need to think about
>> > adding
>> > more hardware and moving some of your replicas to that new hardware.
>> >
>> > Again, this is speculation. But when systems are running with an
>> > _aggregate_
>> > index size that is many multiples of the available memory (total
>> phycisal
>> > memory)
>> > it’s a red flag. I’m guessing a bit since I don’t know the aggregate for
>> > all replicas…
>> >
>> > Best,
>> > Erick
>> >
>> > > On Sep 5, 2019, at 8:08 AM, Doss <itsmed...@gmail.com> wrote:
>> > >
>> > > @Jorn We are adding few more zookeeper nodes soon. Thanks.
>> > >
>> > > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per
>> > node,
>> > > out of which 14 GB assigned for HEAP, you mean to say we have to
>> allocate
>> > > more HEAP? or we need add more Physical RAM?
>> > >
>> > > This system ran for 8 to 9 months without any major issues, in recent
>> > times
>> > > only we are facing too many such incidents.
>> > >
>> > > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <
>> erickerick...@gmail.com>
>> > > wrote:
>> > >
>> > >> If I'm reading this correctly, you have a huge amount of index in not
>> > much
>> > >> memory. You only have 14g allocated across 130 replicas, at least
>> one of
>> > >> which has a 20g index. You don't need as much memory as your
>> aggregate
>> > >> index size, but this system feels severely under provisioned. I
>> suspect
>> > >> that's the root of your instability
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Thu, Sep 5, 2019, 07:08 Doss <itsmed...@gmail.com> wrote:
>> > >>
>> > >>> Hi,
>> > >>>
>> > >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper
>> ensemble.
>> > >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas
>> > NRT)
>> > >>> with index size ranging from 700MB to 20GB.
>> > >>>
>> > >>> autoCommit - 10 minutes once
>> > >>> softCommit - 30 Sec Once
>> > >>>
>> > >>> At peak time if a shard goes to recovery mode many other shards also
>> > >> going
>> > >>> to recovery mode in few minutes, which creates huge load (200+ load
>> > >>> average) and SOLR becomes non responsive. To fix this we are
>> restarting
>> > >> the
>> > >>> node, again leader tries to correct the index by initiating
>> > replication,
>> > >>> which causes load again, and the node goes to non responsive state.
>> > >>>
>> > >>> As soon as a node starts the replication process initiated for all
>> 130
>> > >>> cores, is there any we control it, like one after the other?
>> > >>>
>> > >>> Thanks,
>> > >>> Doss.
>> > >>>
>> > >>
>> >
>> >
>>
>

Re: Production Issue: SOLR node goes to non responsive , restart not helping at peak hours

Reply via email to