tboeghk commented on PR #2783: URL: https://github.com/apache/solr/pull/2783#issuecomment-3104512660
In addition to the great summary of @ardatezcan1 above here are my practical tips and real-world scenarios to run Solr in a high rpm and low to medium dataset environment (like ecommerce appliations). __Best practises using Solr in high rpm environments__ Before starting to optimize your Solr setup, make sure to have strong observability in place. In addition to the [Solr Prometheus and Grafana](https://solr.apache.org/guide/solr/latest/deployment-guide/monitoring-with-prometheus-and-grafana.html) setup I strongly recommend setting up the [Node Exporter](https://github.com/prometheus/node_exporter) to gather and correlate machine metrics. * __Use Solr in cloud mode__: Running Solr in cloud mode and in a Zookeeper ensemble is a prerequisite to the following best-practices. Cloud mode enables easy addition and removal of Solr cluster nodes depending on the current traffic. * __Sharding__: Request processing in Solr is a single threaded operation. The larger your dataset the more latency you'll add to request processing. The only (sustainable) way to make query processing a multi-threaded operation is to shard your index. Depending on your workload, you could simply run multiple Solr instances on the same machine. I recommend a single Solr instance per machine though. * __Sharding strategies__: If your query processing strategy uses [collapse (and expand or grouping)](https://solr.apache.org/guide/solr/latest/query-guide/collapse-and-expand-results.html), make sure to put all documents to a grouping key on the same shard. Adjust the [document routing](https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#document-routing) and `router.field` to your grouping key. * __Indexing and optimization strategies__: Indexing into a live collection adds significant latency to your search requests. Each commit flushes the internal caches and those caches keep Solr running fast. Avoid any unnecessary cache flushes! * __Optimize your index__: Manually optimizing your index is not recommended but delivers the best performance as deleted documents are pruned from the index. * __Rotate collections__: For smaller to medium datasets it might be a good strategy to periodically index your data into a new collection instead of updating an existing one. That way, requests caches stay warm for the lifetime of a collection and a manual optimize is possible. Use [collection aliases](https://solr.apache.org/guide/solr/latest/deployment-guide/aliases.html) to switch clients to the new collection. * __Use dedicated node setups__: In high traffic environments, a separation of concerns gets more important. Use dedicated node types and machine sizings/setup for optimal perfomance tailored to the machines role. * __Indexer__: Solely used for indexing products. Set up as [`TLOG`](https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#types-of-replicas) replica type. Must not be used for request processing. Exclude `TLOG` node types from request processing using the [`shards.preference`](https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-distributed-requests.html#shards-preference-parameter) parameter configured at your request handlers. * __Data__: Set up as a `PULL` replica. Replicates it's index from the indexer nodes via Solr cloud. Using `TLOG` and `PULL` replicas avoids that index data is being pulled off data nodes (as with `NRT` replicas). * __Coordinator__: In sharded Solr cloud setups, these nodes coordinate the distributed request flow and assemble the final search request result. This is a very CPU intensive operation and is usually shared among the data nodes. The usage of dedicated [coordinator nodes](https://solr.apache.org/guide/solr/latest/deployment-guide/node-roles.html#coordinator-role) separates the compute overhead of coordinating distributed requests off of the data nodes. Adding coordinator nodes to a Solr cloud setup will drop the resource usage on data nodes significantly. To make full use of coordinator nodes, direct all incoming request traffic to these nodes. * __JVM tuning__: I highly recommend running Solr on _G1GC garbage collector_. Keep in mind the golden rule of keeping 50% heap for disk cache on data and indexer nodes. As coordinator nodes are stateless, you can boost their performance significantly with the _ZGC garbage collector_. It slashes collection pauses from milli- to nanoseconds. * __Cloud setup__: Most Solr cloud setups will run in some kind of cloud environment. Here are some tipps to setup an elastic Solr cloud environment. * __Autoscaling__: Use a dedicated autoscaling group for each node type and each shard. Use tags to mark which instance should replicate which shard. Configure your heap settings dynamically and configure a wide range of instance types. Build a custom script to replicate data upon instance start. Use the Solr collections api to [remove a node from the cluster](https://solr.apache.org/guide/solr/latest/deployment-guide/cluster-node-management.html#deletenode) during instance termination. * __Spot instances__: Coordinator and data nodes are great to run as spot instances. This will save a big bunch of cloud spendings. * __ARM instance types__: Utilize ARM instance types wherever possible. The Solr Docker image is also pre-built for ARM architectures. ARM cpus offer the best bang for the buck and a more consistent response latency (as their CPU is not power managed). If you need more information or help to compile the whole information into a single document let me know! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
