Hi All,

Thanks for all the engagemant.

Use tlog+pull replicas, they will improve the situation significantly: Do
you mean using tlog/ pull replicas to serve search requests and a separate
set of replicas for indexing? We have tried this in the past, but either
requires a separate set of infra[which will double the cost] or all the
traffic gets redirected onto the nrt replicas if any issue is faced on the
pull/ tlog replicas.

Can you please tell me about the hardware details (Server type, CPU speed
and type, Disk Speed and type) and GC configuration? Also please post
results of top, iotop if you can?

*CPU Details:*

model name: Intel(R) Xeon(R) CPU @ 2.80GHz

cpu MHz         : 2800.198

*Disk speed:* The VMs used are GCP n2 machines with 16 core CPU and 50 gigs
of RAM [n2 machine with nCPU >=16 has SSD of maximum throughput of 1200
mbps, Source
<https://cloud.google.com/compute/docs/disks/performance#pd-ssd>]


*Top/Itop Result: *Please find files attached with load and io wait stats
for a solr server which faced this issue on 11 June, 2024, from 11 AM to 2
PM where load was consistently higher than usual and indexing was also
higher than usual that day.

For IO wait stats: io-wait-details-solr-node-11-June-2024

For Load stats: load-details-solr-node-11-June-2024

Are you having iowait, gc pauses, or something else? Do you commit often or
in one big batch?

The iowait is <0.05.

Commit is configured in such a way that autoSoftCommit runs once every 1
hour, and autoCommit runs every 30 min [with openSearcher=false]. Please
refer to the io wait stats for the duration when load exists. No
correlation with io wait found!

On Thu, Jun 20, 2024 at 5:15 PM matthew sporleder <msporle...@gmail.com>
wrote:

> Are you having iowait, gc pauses, or something else? Do you commit often
> or in one big batch?
>
> > On Jun 20, 2024, at 12:26 AM, Saksham Gupta 
> > <saksham.gu...@indiamart.com.invalid>
> wrote:
> >
> > Hi All,
> >
> > We have been facing extra load incidents due to higher gc count and gc
> time
> > causing higher response time and timeouts.
> >
> > Solr Cloud Cluster Details
> >
> > We use solr cloud v8.10 [with java 8 and G1 GC] with 8 shards where each
> > shard is present on a single vm of 16 cores and 50 gb RAM. Size of each
> > shard is ~28 gb and heap of solr is 16 gb [heap utilization only for
> > filter, document, and queryResults cache each of size 512].
> >
> > Problem Details
> >
> > We pause indexing at 11 AM during peak searching hours. Normally the
> system
> > remains stable during the peak hours, but when documents update count on
> > solr is higher before peak hours [b/w from 5.30 AM to 11 AM], we face
> > multiple load issues. The gc count and gc time increases and cpu is
> > consumed in gc itself thereby increasing load and response time of the
> > system. To mitigate this, we recently increased the ram on the servers
> [to
> > 50 gb from 42 gb previously], as to reduce the io wait for writing solr
> > index on memory multiple times. Taking a step further, we also increased
> > the heap of solr from 12 to 16 gb [also tried other combinations like 14
> > gb, 15 gb, 18 gb], although we found some reduction in load issues due to
> > lower io wait, still the issue recurs when higher indexing is done.
> >
> > We have explored a few options like expunge deletes, which may help
> reduce
> > the deleted documents percentage, but that cannot be executed close to
> peak
> > hours, as it increases io wait which further spikes load and response
> time
> > of solr significantly.
> >
> >
> >   1.
> >
> >   Apart from changing the expunge deletes timing, is there another option
> >   which we can try to mitigate this problem?
> >   2.
> >
> >   Approximately 60 million documents are updated each day i.e. ~30% of
> the
> >   complete solr index is modified each day while serving ~20 million
> search
> >   requests. Would appreciate any knowledge upon how to handle such high
> >   indexing + searching traffic during peak hours.
>

Reply via email to