Thanks Eric for the explanation. Sum of all our index size is about 138 GB,
only 2 indexes are > 19 GB, time to scale up :-). Adding new hardware will
require at least couple of days, till that time is there any option to
control the replication method?

Thanks,
Doss.

On Thu, Sep 5, 2019 at 6:12 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> You say you have three nodes, 130 replicas and a replication factor of 3,
> so
> you have 130 cores/node. At least one of those cores has a 20G index,
> right?
>
> What is the sum of all the indexes on a single physical machine?
>
> I think your system is under-provisioned and that you’ve been riding at
> the edge
> of instability for quite some time and have added enough more docs that
> you finally reached a tipping point. But that’s largely speculation.
>
> So adding more heap may help. But Real Soon Now you need to think about
> adding
> more hardware and moving some of your replicas to that new hardware.
>
> Again, this is speculation. But when systems are running with an
> _aggregate_
> index size that is many multiples of the available memory (total phycisal
> memory)
> it’s a red flag. I’m guessing a bit since I don’t know the aggregate for
> all replicas…
>
> Best,
> Erick
>
> > On Sep 5, 2019, at 8:08 AM, Doss <itsmed...@gmail.com> wrote:
> >
> > @Jorn We are adding few more zookeeper nodes soon. Thanks.
> >
> > @ Erick, sorry I couldn't understand it clearly, we have 90GB RAM per
> node,
> > out of which 14 GB assigned for HEAP, you mean to say we have to allocate
> > more HEAP? or we need add more Physical RAM?
> >
> > This system ran for 8 to 9 months without any major issues, in recent
> times
> > only we are facing too many such incidents.
> >
> > On Thu, Sep 5, 2019 at 5:20 PM Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> If I'm reading this correctly, you have a huge amount of index in not
> much
> >> memory. You only have 14g allocated across 130 replicas, at least one of
> >> which has a 20g index. You don't need as much memory as your aggregate
> >> index size, but this system feels severely under provisioned. I suspect
> >> that's the root of your instability
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Sep 5, 2019, 07:08 Doss <itsmed...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble.
> >>> Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas
> NRT)
> >>> with index size ranging from 700MB to 20GB.
> >>>
> >>> autoCommit - 10 minutes once
> >>> softCommit - 30 Sec Once
> >>>
> >>> At peak time if a shard goes to recovery mode many other shards also
> >> going
> >>> to recovery mode in few minutes, which creates huge load (200+ load
> >>> average) and SOLR becomes non responsive. To fix this we are restarting
> >> the
> >>> node, again leader tries to correct the index by initiating
> replication,
> >>> which causes load again, and the node goes to non responsive state.
> >>>
> >>> As soon as a node starts the replication process initiated for all 130
> >>> cores, is there any we control it, like one after the other?
> >>>
> >>> Thanks,
> >>> Doss.
> >>>
> >>
>
>

Reply via email to