My mistake on the link, which should be this: https://lucene.apache.org/solr/guide/7_1/solrcloud-autoscaling-auto-add-replicas.html#implementation-using-autoaddreplicas-trigger
--Jack On Thu, Sep 5, 2019 at 11:02 AM Jack Schlederer <schleder...@gmail.com> wrote: > I'd defer to the committers if they have any further advice, but you might > have to suspend the autoAddReplicas trigger through the autoscaling API ( > https://solr.stage.ecommerce.sandbox.directsupply-sandbox.cloud:8985/solr/ ) > if you set up your collections with autoAddReplicas enabled. Then, the > system will not try to re-create missing replicas. > > Just another note on your setup-- It seems to me like using only 3 nodes > for 168 GB worth of indices isn't making the most of SolrCloud, which > provides the capabilities for sharding indices across a high number of > nodes. Just a data point for you to consider when considering your cluster > sizing, my org is running only about 50GB of indices, but we run it over 35 > nodes with 8GB of heap apiece, each collection with 2+ shards. > > --Jack > > On Thu, Sep 5, 2019 at 8:47 AM Doss <itsmed...@gmail.com> wrote: > >> Thanks Eric for the explanation. Sum of all our index size is about 138 >> GB, >> only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will >> require at least couple of days, till that time is there any option to >> control the replication method? >> >> Thanks, >> Doss. >> >> On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson <erickerick...@gmail.com> >> wrote: >> >> > You say you have three nodes, 130 replicas and a replication factor of >> 3, >> > so >> > you have 130 cores/node. At least one of those cores has a 20G index, >> > right? >> > >> > What is the sum of all the indexes on a single physical machine? >> > >> > I think your system is under-provisioned and that you’ve been riding at >> > the edge >> > of instability for quite some time and have added enough more docs that >> > you finally reached a tipping point. But that’s largely speculation. >> > >> > So adding more heap may help. But Real Soon Now you need to think about >> > adding >> > more hardware and moving some of your replicas to that new hardware. >> > >> > Again, this is speculation. But when systems are running with an >> > _aggregate_ >> > index size that is many multiples of the available memory (total >> phycisal >> > memory) >> > it’s a red flag. I’m guessing a bit since I don’t know the aggregate for >> > all replicas… >> > >> > Best, >> > Erick >> > >> > > On Sep 5, 2019, at 8:08 AM, Doss <itsmed...@gmail.com> wrote: >> > > >> > > @Jorn We are adding few more zookeeper nodes soon. Thanks. >> > > >> > > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per >> > node, >> > > out of which 14 GB assigned for HEAP, you mean to say we have to >> allocate >> > > more HEAP? or we need add more Physical RAM? >> > > >> > > This system ran for 8 to 9 months without any major issues, in recent >> > times >> > > only we are facing too many such incidents. >> > > >> > > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson < >> erickerick...@gmail.com> >> > > wrote: >> > > >> > >> If I'm reading this correctly, you have a huge amount of index in not >> > much >> > >> memory. You only have 14g allocated across 130 replicas, at least >> one of >> > >> which has a 20g index. You don't need as much memory as your >> aggregate >> > >> index size, but this system feels severely under provisioned. I >> suspect >> > >> that's the root of your instability >> > >> >> > >> Best, >> > >> Erick >> > >> >> > >> On Thu, Sep 5, 2019, 07:08 Doss <itsmed...@gmail.com> wrote: >> > >> >> > >>> Hi, >> > >>> >> > >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper >> ensemble. >> > >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas >> > NRT) >> > >>> with index size ranging from 700MB to 20GB. >> > >>> >> > >>> autoCommit - 10 minutes once >> > >>> softCommit - 30 Sec Once >> > >>> >> > >>> At peak time if a shard goes to recovery mode many other shards also >> > >> going >> > >>> to recovery mode in few minutes, which creates huge load (200+ load >> > >>> average) and SOLR becomes non responsive. To fix this we are >> restarting >> > >> the >> > >>> node, again leader tries to correct the index by initiating >> > replication, >> > >>> which causes load again, and the node goes to non responsive state. >> > >>> >> > >>> As soon as a node starts the replication process initiated for all >> 130 >> > >>> cores, is there any we control it, like one after the other? >> > >>> >> > >>> Thanks, >> > >>> Doss. >> > >>> >> > >> >> > >> > >> >