Thanks Eric for the explanation. Sum of all our index size is about 138 GB, only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will require at least couple of days, till that time is there any option to control the replication method?
Thanks, Doss. On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson <erickerick...@gmail.com> wrote: > You say you have three nodes, 130 replicas and a replication factor of 3, > so > you have 130 cores/node. At least one of those cores has a 20G index, > right? > > What is the sum of all the indexes on a single physical machine? > > I think your system is under-provisioned and that you’ve been riding at > the edge > of instability for quite some time and have added enough more docs that > you finally reached a tipping point. But that’s largely speculation. > > So adding more heap may help. But Real Soon Now you need to think about > adding > more hardware and moving some of your replicas to that new hardware. > > Again, this is speculation. But when systems are running with an > _aggregate_ > index size that is many multiples of the available memory (total phycisal > memory) > it’s a red flag. I’m guessing a bit since I don’t know the aggregate for > all replicas… > > Best, > Erick > > > On Sep 5, 2019, at 8:08 AM, Doss <itsmed...@gmail.com> wrote: > > > > @Jorn We are adding few more zookeeper nodes soon. Thanks. > > > > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per > node, > > out of which 14 GB assigned for HEAP, you mean to say we have to allocate > > more HEAP? or we need add more Physical RAM? > > > > This system ran for 8 to 9 months without any major issues, in recent > times > > only we are facing too many such incidents. > > > > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> If I'm reading this correctly, you have a huge amount of index in not > much > >> memory. You only have 14g allocated across 130 replicas, at least one of > >> which has a 20g index. You don't need as much memory as your aggregate > >> index size, but this system feels severely under provisioned. I suspect > >> that's the root of your instability > >> > >> Best, > >> Erick > >> > >> On Thu, Sep 5, 2019, 07:08 Doss <itsmed...@gmail.com> wrote: > >> > >>> Hi, > >>> > >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble. > >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas > NRT) > >>> with index size ranging from 700MB to 20GB. > >>> > >>> autoCommit - 10 minutes once > >>> softCommit - 30 Sec Once > >>> > >>> At peak time if a shard goes to recovery mode many other shards also > >> going > >>> to recovery mode in few minutes, which creates huge load (200+ load > >>> average) and SOLR becomes non responsive. To fix this we are restarting > >> the > >>> node, again leader tries to correct the index by initiating > replication, > >>> which causes load again, and the node goes to non responsive state. > >>> > >>> As soon as a node starts the replication process initiated for all 130 > >>> cores, is there any we control it, like one after the other? > >>> > >>> Thanks, > >>> Doss. > >>> > >> > >