Problem understanding why QPS is so low

Ash Ramesh Tue, 19 Mar 2019 15:38:49 -0700

Hi everybody,

My team run a solr cluster which has very low QPS throughput. I have been
going through the different configurations in our setup, and think that
it's probably the way we have defined our request handlers that is causing
the slowness.


Details of our cluster are below the fold.

*Questions:*

   1. Obviously we have a set of 'expensive' boosts here. Are there any
   inherent anti pattens obvious in the request handler?
   2. Is it normal for such a request handler to max out at around 13 QPS
   before latency starts hitting 2-4 seconds?
   3. Have we maybe architected our cluster incorrectly?
   4. Are there any patterns we should adopt to increase through put?


Thank you so much for taking time to read this email. We would really
appreciate any feedback. We are happy to provide more details into our
cluster if needed.

Regards,

Ash

*Information about our Cluster:*

   - *Solr Version: *7.4.0
   - *Architecture: *TLOG/PULL - 8 Shards (Default shard hashing)
      - Doc Count: 50 Million ~
      - TLOG - EC2 Machines hosting TLOGs have all 8 shards. Approximately
      12G index total
      - PULL - EC2 Machines host 2 shards. There are 4 ASGs such that each
      ASG host one of the shard combinations - [shard1, shard2], [shard3,
      shard4], [shard5, shard6], [shard7, shard8]
         - We scale up on CPU utilisation
      - Schema: No stored fields (except for `id`)
      - Indexing: Use the SolrJ Zookeeper Client to talk directly to TLOGs
      to update (fully replace) documents
         - Deleted docs: Between 10-25% depending on when the merge policy
         was last executed
      - Request Serving: PULL ASGs are wrapped around a ELB, such that we
      use the SolrJ HTTP Client to make requests.
         - All read requests are sent with the
         '"shards.preference=replica.location:local,replica.type:PULL"' in an
         attempt to direct all traffic to PULL nodes.
         - *Average QPS* per full copy of the index (PULL nodes of
   shard1-shard8): *13 queries per second*
   - *Heap Size PULL: *15G
      - Index is fully memory mapped with extra RAM to spare on all PULL
      machines
   - *Solr Caches:*
      - Document Cache: Disabled - No stored fields, seems pointless
      - Query Cache: Disabled - too many different queries no reason to use
      this
      - Filter Cache: 1600 in size (900 autowarm) - we have a set of well
      defined filter queries, we are thinking of increasing this since hit rate
      is 0.86

*Example Request Handler (Obfuscated field names and boost values)*


*<requestHandler name="/select/foo" class="solr.SearchHandler">*

*    <lst name="defaults">*

*      <!-- START Business specific variables -->*
*      <str name="lang">en</str>*
*      <!-- END Business specific variables -->*

*      <str name="defType">edismax</str>*
*      <str name="rows">10</str>*
*      <str name="fl">id</str>*
*      <str name="uf">* _query_</str>*

*      <str name="qf">fieldA^0.99 fieldB^0.99 fieldC^0.99 fieldD^0.99
fieldE^0.99</str>*
*      <str name="f.fieldA.qf">fieldA_$${lang}</str>*
*      <str name="f.fieldB.qf">fieldB_$${lang}</str>*
*      <str name="f.fieldC.qf">fieldC_$${lang}</str>*
*      <str name="f.fieldD.qf">fieldD_$${lang}</str>*
*      <str name="f.textContent.qf">textContent_$${lang}</str>*


*      <str name="ps">2</str>*
*      <str name="tie">0.99</str>*
*      <str name="sow">true</str>*

*      <!-- f.xyz.qf aliases don't work for pf-->*
*      <str name="pf">fieldA_$${lang}^0.99 fieldB_$${lang}^0.99</str>*
*    </lst>*

*    <lst name="invariants">*
*      <str name="q">{!type=edismax v=$qq}</str>*
*    </lst>*

*    <lst name="appends">*
*      <str name="bq">{!edismax qf=fieldA^0.99 mm=100% bq="" boost="" pf=""
tie=1.00 v=$qq}</str>*
*      <str name="bq">{!edismax qf=fieldB^0.99 mm=100% bq="" boost="" pf=""
tie=1.00 v=$qq}</str>*
*      <str name="bq">{!edismax qf=fieldC^0.99 mm=100% bq="" boost="" pf=""
tie=1.00 v=$qq}</str>*
*      <str name="bq">{!edismax qf=fieldD^0.99 mm=100% bq="" boost="" pf=""
tie=1.00 v=$qq}</str>*

*      <str name="boost">{!edismax qf=fieldA^0.99 fieldB^0.99 fieldC^0.99
fieldD^0.99 mm=100% bq="" boost="" pf="" tie=1.00 v=$qq}</str>*

*      <str name="bq">{!func}mul(termfreq(docBoostFieldB,$qq),100)</str>*
*      <str
name="boost">if(termfreq(docBoostFieldB,$qq),1,def(docBoostFieldA,1))</str>*
*    </lst>*

*    <arr name="last-components">*
*      <str>elevator</str>*
*    </arr>*

*  </requestHandler>*

*Notes:*

   - *We have a data science team that feeds back click through data to the
   boostFields to re-order results for popular queries*
   - *We do sorting on 'score DESC dateSubmitted DESC'*
   - *We use the 'elevator' component quite heavily - e.g.
   'elevateIds=A,B,C'*
   - *We have some localized fields - thus we do aliasing in the request
   handler*

-- 
*P.S. We've launched a new blog to share the latest ideas and case studies 
from our team. Check it out here: product.canva.com 
<https://product.canva.com/>. ***
** <https://www.canva.com/>Empowering the 
world to design
Also, we're hiring. Apply here! 
<https://about.canva.com/careers/>
 <https://twitter.com/canva> 
<https://facebook.com/canva> <https://au.linkedin.com/company/canva> 
<https://twitter.com/canva>  <https://facebook.com/canva>  
<https://au.linkedin.com/company/canva>  <https://instagram.com/canva>

Problem understanding why QPS is so low

Reply via email to