Hi everybody, My team run a solr cluster which has very low QPS throughput. I have been going through the different configurations in our setup, and think that it's probably the way we have defined our request handlers that is causing the slowness.
Details of our cluster are below the fold. *Questions:* 1. Obviously we have a set of 'expensive' boosts here. Are there any inherent anti pattens obvious in the request handler? 2. Is it normal for such a request handler to max out at around 13 QPS before latency starts hitting 2-4 seconds? 3. Have we maybe architected our cluster incorrectly? 4. Are there any patterns we should adopt to increase through put? Thank you so much for taking time to read this email. We would really appreciate any feedback. We are happy to provide more details into our cluster if needed. Regards, Ash *Information about our Cluster:* - *Solr Version: *7.4.0 - *Architecture: *TLOG/PULL - 8 Shards (Default shard hashing) - Doc Count: 50 Million ~ - TLOG - EC2 Machines hosting TLOGs have all 8 shards. Approximately 12G index total - PULL - EC2 Machines host 2 shards. There are 4 ASGs such that each ASG host one of the shard combinations - [shard1, shard2], [shard3, shard4], [shard5, shard6], [shard7, shard8] - We scale up on CPU utilisation - Schema: No stored fields (except for `id`) - Indexing: Use the SolrJ Zookeeper Client to talk directly to TLOGs to update (fully replace) documents - Deleted docs: Between 10-25% depending on when the merge policy was last executed - Request Serving: PULL ASGs are wrapped around a ELB, such that we use the SolrJ HTTP Client to make requests. - All read requests are sent with the '"shards.preference=replica.location:local,replica.type:PULL"' in an attempt to direct all traffic to PULL nodes. - *Average QPS* per full copy of the index (PULL nodes of shard1-shard8): *13 queries per second* - *Heap Size PULL: *15G - Index is fully memory mapped with extra RAM to spare on all PULL machines - *Solr Caches:* - Document Cache: Disabled - No stored fields, seems pointless - Query Cache: Disabled - too many different queries no reason to use this - Filter Cache: 1600 in size (900 autowarm) - we have a set of well defined filter queries, we are thinking of increasing this since hit rate is 0.86 *Example Request Handler (Obfuscated field names and boost values)* *<requestHandler name="/select/foo" class="solr.SearchHandler">* * <lst name="defaults">* * <!-- START Business specific variables -->* * <str name="lang">en</str>* * <!-- END Business specific variables -->* * <str name="defType">edismax</str>* * <str name="rows">10</str>* * <str name="fl">id</str>* * <str name="uf">* _query_</str>* * <str name="qf">fieldA^0.99 fieldB^0.99 fieldC^0.99 fieldD^0.99 fieldE^0.99</str>* * <str name="f.fieldA.qf">fieldA_$${lang}</str>* * <str name="f.fieldB.qf">fieldB_$${lang}</str>* * <str name="f.fieldC.qf">fieldC_$${lang}</str>* * <str name="f.fieldD.qf">fieldD_$${lang}</str>* * <str name="f.textContent.qf">textContent_$${lang}</str>* * <str name="ps">2</str>* * <str name="tie">0.99</str>* * <str name="sow">true</str>* * <!-- f.xyz.qf aliases don't work for pf-->* * <str name="pf">fieldA_$${lang}^0.99 fieldB_$${lang}^0.99</str>* * </lst>* * <lst name="invariants">* * <str name="q">{!type=edismax v=$qq}</str>* * </lst>* * <lst name="appends">* * <str name="bq">{!edismax qf=fieldA^0.99 mm=100% bq="" boost="" pf="" tie=1.00 v=$qq}</str>* * <str name="bq">{!edismax qf=fieldB^0.99 mm=100% bq="" boost="" pf="" tie=1.00 v=$qq}</str>* * <str name="bq">{!edismax qf=fieldC^0.99 mm=100% bq="" boost="" pf="" tie=1.00 v=$qq}</str>* * <str name="bq">{!edismax qf=fieldD^0.99 mm=100% bq="" boost="" pf="" tie=1.00 v=$qq}</str>* * <str name="boost">{!edismax qf=fieldA^0.99 fieldB^0.99 fieldC^0.99 fieldD^0.99 mm=100% bq="" boost="" pf="" tie=1.00 v=$qq}</str>* * <str name="bq">{!func}mul(termfreq(docBoostFieldB,$qq),100)</str>* * <str name="boost">if(termfreq(docBoostFieldB,$qq),1,def(docBoostFieldA,1))</str>* * </lst>* * <arr name="last-components">* * <str>elevator</str>* * </arr>* * </requestHandler>* *Notes:* - *We have a data science team that feeds back click through data to the boostFields to re-order results for popular queries* - *We do sorting on 'score DESC dateSubmitted DESC'* - *We use the 'elevator' component quite heavily - e.g. 'elevateIds=A,B,C'* - *We have some localized fields - thus we do aliasing in the request handler* -- *P.S. We've launched a new blog to share the latest ideas and case studies from our team. Check it out here: product.canva.com <https://product.canva.com/>. *** ** <https://www.canva.com/>Empowering the world to design Also, we're hiring. Apply here! <https://about.canva.com/careers/> <https://twitter.com/canva> <https://facebook.com/canva> <https://au.linkedin.com/company/canva> <https://twitter.com/canva> <https://facebook.com/canva> <https://au.linkedin.com/company/canva> <https://instagram.com/canva>