Hello, The attachments did not come through (of top and iotop). Also please post GC configurations and the results of GC pauses?
Thanks. Deepak "The greatness of a nation can be judged by the way its animals are treated - Mahatma Gandhi" +91 73500 12833 deic...@gmail.com LinkedIn: www.linkedin.com/in/deicool "Plant a Tree, Go Green" Make In India : http://www.makeinindia.com/home On Sun, Jun 23, 2024 at 2:41 PM Saksham Gupta <saksham.gu...@indiamart.com.invalid> wrote: > Hi All, > > Thanks for all the engagemant. > > Use tlog+pull replicas, they will improve the situation significantly: Do > you mean using tlog/ pull replicas to serve search requests and a separate > set of replicas for indexing? We have tried this in the past, but either > requires a separate set of infra[which will double the cost] or all the > traffic gets redirected onto the nrt replicas if any issue is faced on the > pull/ tlog replicas. > > Can you please tell me about the hardware details (Server type, CPU speed > and type, Disk Speed and type) and GC configuration? Also please post > results of top, iotop if you can? > > *CPU Details:* > > model name: Intel(R) Xeon(R) CPU @ 2.80GHz > > cpu MHz : 2800.198 > > *Disk speed:* The VMs used are GCP n2 machines with 16 core CPU and 50 > gigs of RAM [n2 machine with nCPU >=16 has SSD of maximum throughput of > 1200 mbps, Source > <https://cloud.google.com/compute/docs/disks/performance#pd-ssd>] > > > *Top/Itop Result: *Please find files attached with load and io wait stats > for a solr server which faced this issue on 11 June, 2024, from 11 AM to 2 > PM where load was consistently higher than usual and indexing was also > higher than usual that day. > > For IO wait stats: io-wait-details-solr-node-11-June-2024 > > For Load stats: load-details-solr-node-11-June-2024 > > Are you having iowait, gc pauses, or something else? Do you commit often > or in one big batch? > > The iowait is <0.05. > > Commit is configured in such a way that autoSoftCommit runs once every 1 > hour, and autoCommit runs every 30 min [with openSearcher=false]. Please > refer to the io wait stats for the duration when load exists. No > correlation with io wait found! > > On Thu, Jun 20, 2024 at 5:15 PM matthew sporleder <msporle...@gmail.com> > wrote: > >> Are you having iowait, gc pauses, or something else? Do you commit often >> or in one big batch? >> >> > On Jun 20, 2024, at 12:26 AM, Saksham Gupta < >> saksham.gu...@indiamart.com.invalid> wrote: >> > >> > Hi All, >> > >> > We have been facing extra load incidents due to higher gc count and gc >> time >> > causing higher response time and timeouts. >> > >> > Solr Cloud Cluster Details >> > >> > We use solr cloud v8.10 [with java 8 and G1 GC] with 8 shards where each >> > shard is present on a single vm of 16 cores and 50 gb RAM. Size of each >> > shard is ~28 gb and heap of solr is 16 gb [heap utilization only for >> > filter, document, and queryResults cache each of size 512]. >> > >> > Problem Details >> > >> > We pause indexing at 11 AM during peak searching hours. Normally the >> system >> > remains stable during the peak hours, but when documents update count on >> > solr is higher before peak hours [b/w from 5.30 AM to 11 AM], we face >> > multiple load issues. The gc count and gc time increases and cpu is >> > consumed in gc itself thereby increasing load and response time of the >> > system. To mitigate this, we recently increased the ram on the servers >> [to >> > 50 gb from 42 gb previously], as to reduce the io wait for writing solr >> > index on memory multiple times. Taking a step further, we also increased >> > the heap of solr from 12 to 16 gb [also tried other combinations like 14 >> > gb, 15 gb, 18 gb], although we found some reduction in load issues due >> to >> > lower io wait, still the issue recurs when higher indexing is done. >> > >> > We have explored a few options like expunge deletes, which may help >> reduce >> > the deleted documents percentage, but that cannot be executed close to >> peak >> > hours, as it increases io wait which further spikes load and response >> time >> > of solr significantly. >> > >> > >> > 1. >> > >> > Apart from changing the expunge deletes timing, is there another >> option >> > which we can try to mitigate this problem? >> > 2. >> > >> > Approximately 60 million documents are updated each day i.e. ~30% of >> the >> > complete solr index is modified each day while serving ~20 million >> search >> > requests. Would appreciate any knowledge upon how to handle such high >> > indexing + searching traffic during peak hours. >> >