The thing that jumps out at me are: 1> your filterCache. The autowarm count is very, very high. I usually start with about 16. This will be especially important if you open new searchers often, i.e. your soft commit interval or hard-commit-with-opensearcher-true. Esentially you’re executing 900 filter queries just as though you’d sent them from outside Solr every time you open a searcher. I’d also make it substantially smaller. It’ll suck up approximately maxDoc/8 bytes per entry.
My fear is that you’re spending a lot of resources on this cache that are starving your queries… What are your autocommit settings? And are you continuously indexing? If so at what rate? 2> A 2-4 second latency is very, very large for a corpus this size, something is (probably) misconfigured. 3> Document cache disabled. If you can absolutely guarantee that stored fields are never returned you’re probably OK. The purpose of the documentCache is to prevent components of the _same_ query from re-reading and decompressing minimum 16K blocks for each returned document. You’re right that if there are no stored fields this is probably useless, OTOH if there are no docs, then it won’t be used either, it’s not pre-allocated. 4> on a quick glance your boosts don’t appear too bad, I’d be surprised if this were the root cause. 5> I think you’re oversharded unless you expect to have much larger numbers of documents. My rule of thumb is to expect about 50M docs/shard given reasonable machines. YMMV of course. If that intuition is true, you could get much greater throughput by having fewer shards. 6> Here’s what I’d do. I’m guessing you have some kind of load testing tool jMeter or the like (you _must_ use “real” queries BTW). Go ahead and set up a test run on a node, changing one thing at a time. For instance, you could remove all of the boosting and test that. Try with no indexing going on, etc. 7> Throw a profiler at it to find out where you’re spending time. My suggestions are guesswork at some level. It may be easier to try some things rather than get a profiler running on it but… 8> You could work with an isolated machine, even a local box. Just create a one-shard collection, then copy the index from _one_ of your shards locally to the right place and beat it up. Good luck! Erick > On Mar 19, 2019, at 3:38 PM, Ash Ramesh <ash...@canva.com> wrote: > > Hi everybody, > > My team run a solr cluster which has very low QPS throughput. I have been > going through the different configurations in our setup, and think that > it's probably the way we have defined our request handlers that is causing > the slowness. > > Details of our cluster are below the fold. > > *Questions:* > > 1. Obviously we have a set of 'expensive' boosts here. Are there any > inherent anti pattens obvious in the request handler? > 2. Is it normal for such a request handler to max out at around 13 QPS > before latency starts hitting 2-4 seconds? > 3. Have we maybe architected our cluster incorrectly? > 4. Are there any patterns we should adopt to increase through put? > > > Thank you so much for taking time to read this email. We would really > appreciate any feedback. We are happy to provide more details into our > cluster if needed. > > Regards, > > Ash > > *Information about our Cluster:* > > - *Solr Version: *7.4.0 > - *Architecture: *TLOG/PULL - 8 Shards (Default shard hashing) > - Doc Count: 50 Million ~ > - TLOG - EC2 Machines hosting TLOGs have all 8 shards. Approximately > 12G index total > - PULL - EC2 Machines host 2 shards. There are 4 ASGs such that each > ASG host one of the shard combinations - [shard1, shard2], [shard3, > shard4], [shard5, shard6], [shard7, shard8] > - We scale up on CPU utilisation > - Schema: No stored fields (except for `id`) > - Indexing: Use the SolrJ Zookeeper Client to talk directly to TLOGs > to update (fully replace) documents > - Deleted docs: Between 10-25% depending on when the merge policy > was last executed > - Request Serving: PULL ASGs are wrapped around a ELB, such that we > use the SolrJ HTTP Client to make requests. > - All read requests are sent with the > '"shards.preference=replica.location:local,replica.type:PULL"' in an > attempt to direct all traffic to PULL nodes. > - *Average QPS* per full copy of the index (PULL nodes of > shard1-shard8): *13 queries per second* > - *Heap Size PULL: *15G > - Index is fully memory mapped with extra RAM to spare on all PULL > machines > - *Solr Caches:* > - Document Cache: Disabled - No stored fields, seems pointless > - Query Cache: Disabled - too many different queries no reason to use > this > - Filter Cache: 1600 in size (900 autowarm) - we have a set of well > defined filter queries, we are thinking of increasing this since hit rate > is 0.86 > > *Example Request Handler (Obfuscated field names and boost values)* > > > *<requestHandler name="/select/foo" class="solr.SearchHandler">* > > * <lst name="defaults">* > > * <!-- START Business specific variables -->* > * <str name="lang">en</str>* > * <!-- END Business specific variables -->* > > * <str name="defType">edismax</str>* > * <str name="rows">10</str>* > * <str name="fl">id</str>* > * <str name="uf">* _query_</str>* > > * <str name="qf">fieldA^0.99 fieldB^0.99 fieldC^0.99 fieldD^0.99 > fieldE^0.99</str>* > * <str name="f.fieldA.qf">fieldA_$${lang}</str>* > * <str name="f.fieldB.qf">fieldB_$${lang}</str>* > * <str name="f.fieldC.qf">fieldC_$${lang}</str>* > * <str name="f.fieldD.qf">fieldD_$${lang}</str>* > * <str name="f.textContent.qf">textContent_$${lang}</str>* > > > * <str name="ps">2</str>* > * <str name="tie">0.99</str>* > * <str name="sow">true</str>* > > * <!-- f.xyz.qf aliases don't work for pf-->* > * <str name="pf">fieldA_$${lang}^0.99 fieldB_$${lang}^0.99</str>* > * </lst>* > > * <lst name="invariants">* > * <str name="q">{!type=edismax v=$qq}</str>* > * </lst>* > > * <lst name="appends">* > * <str name="bq">{!edismax qf=fieldA^0.99 mm=100% bq="" boost="" pf="" > tie=1.00 v=$qq}</str>* > * <str name="bq">{!edismax qf=fieldB^0.99 mm=100% bq="" boost="" pf="" > tie=1.00 v=$qq}</str>* > * <str name="bq">{!edismax qf=fieldC^0.99 mm=100% bq="" boost="" pf="" > tie=1.00 v=$qq}</str>* > * <str name="bq">{!edismax qf=fieldD^0.99 mm=100% bq="" boost="" pf="" > tie=1.00 v=$qq}</str>* > > * <str name="boost">{!edismax qf=fieldA^0.99 fieldB^0.99 fieldC^0.99 > fieldD^0.99 mm=100% bq="" boost="" pf="" tie=1.00 v=$qq}</str>* > > * <str name="bq">{!func}mul(termfreq(docBoostFieldB,$qq),100)</str>* > * <str > name="boost">if(termfreq(docBoostFieldB,$qq),1,def(docBoostFieldA,1))</str>* > * </lst>* > > * <arr name="last-components">* > * <str>elevator</str>* > * </arr>* > > * </requestHandler>* > > *Notes:* > > - *We have a data science team that feeds back click through data to the > boostFields to re-order results for popular queries* > - *We do sorting on 'score DESC dateSubmitted DESC'* > - *We use the 'elevator' component quite heavily - e.g. > 'elevateIds=A,B,C'* > - *We have some localized fields - thus we do aliasing in the request > handler* > > -- > *P.S. We've launched a new blog to share the latest ideas and case studies > from our team. Check it out here: product.canva.com > <https://product.canva.com/>. *** > ** <https://www.canva.com/>Empowering the > world to design > Also, we're hiring. Apply here! > <https://about.canva.com/careers/> > <https://twitter.com/canva> > <https://facebook.com/canva> <https://au.linkedin.com/company/canva> > <https://twitter.com/canva> <https://facebook.com/canva> > <https://au.linkedin.com/company/canva> <https://instagram.com/canva> > > > > > >