Re: Re: SolrCloud scaling/optimization for high request rate

Sofiya Strochyk Mon, 29 Oct 2018 08:05:49 -0700

Hi Ere,

Thanks for your advice! I'm aware of the performance problems with deeppaging and unfortunately it is not the case here, as the rows number isalways 24 and next pages are hardly ever requested from what i see inthe logs.



On 29.10.18 11:19, Ere Maijala wrote:

Hi Sofiya,
You've already received a lot of ideas, but I think this wasn't yetmentioned: You didn't specify the number of rows your queries fetch orwhether you're using deep paging in the queries. Both can be realperfomance killers in a sharded index because a large set of recordshave to be fetched from all shards. This consumes a relatively highamount of memory, and even if the servers are able to handle a certainnumber of these queries simultaneously, you'd run into garbagecollection trouble with more queries being served. So just one morething to be aware of!
Regards,
Ere

Sofiya Strochyk kirjoitti 26.10.2018 klo 18.55:
Hi everyone,

We have a SolrCloud setup with the following configuration:

  * 4 nodes (3x128GB RAM Intel Xeon E5-1650v2, 1x64GB RAM Intel Xeon
    E5-1650v2, 12 cores, with SSDs)
  * One collection, 4 shards, each has only a single replica (so 4
    replicas in total), using compositeId router
  * Total index size is about 150M documents/320GB, so about 40M/80GB
    per node
  * Zookeeper is on a separate server
  * Documents consist of about 20 fields (most of them are both stored
    and indexed), average document size is about2kB
  * Queries are mostly 2-3 words in the q field, with 2 fq parameters,
    with complex sort expression (containing IF functions)
  * We don't use faceting due to performance reasons but need to add it
    in the future
  * Majority of the documents are reindexed 2 times/day, as fast as the
    SOLR allows, in batches of 1000-10000 docs. Some of the documents
    are also deleted (by id, not by query)
  * autoCommit is set to maxTime of 1 minute with openSearcher=false and
    autoSoftCommit maxTime is 30 minutes with openSearcher=true. Commits
    from clients are ignored.
  * Heap size is set to 8GB.
Target query rate is up to 500 qps, maybe 300, and we need to keepresponse time at <200ms. But at the moment we only see very goodsearch performance with up to 100 requests per second. Whenever itgrows to about 200, average response time abruptly increases to 0.5-1second. (Also it seems that request rate reported by SOLR in adminmetrics is 2x higher than the real one, because for every query,every shard receives 2 requests: one to obtain IDs and second one toget data by IDs; so target rate for SOLR metrics would be 1000 qps).
During high request load, CPU usage increases dramatically on theSOLR nodes. It doesn't reach 100% but averages at 50-70% on 3 serversand about 93% on 1 server (random server each time, not the smallestone).
The documentation mentions replication to spread the load between theservers. We tested replicating to smaller servers (32GB RAM, IntelCore i7-4770). However, when we tested it, the replicas were goingout of sync all the time (possibly during commits) and reportederrors like "PeerSync Recovery was not successful - tryingreplication." Then they proceed with replication which takes hoursand the leader handles all requests singlehandedly during that time.Also both leaders and replicas started encountering OOM errors (heapspace) for unknown reason. Heap dump analysis shows that most of thememory is consumed by [J (array of long) type, my best guess would bethat it is "_version_" field, but it's still unclear why it happens.Also, even though with replication request rate and CPU usage drop 2times, it doesn't seem to affect mean_ms, stddev_ms or p95_ms numbers(p75_ms is much smaller on nodes with replication, but still not aslow as under load of <100 requests/s).
Garbage collection is much more active during high load as well. FullGC happens almost exclusively during those times. We have triedtuning GC options like suggested here<https://wiki.apache.org/solr/ShawnHeisey#CMS_.28ConcurrentMarkSweep.29_Collector>and it didn't change things though.
My questions are

  * How do we increase throughput? Is replication the only solution?
  * if yes - then why doesn't it affect response times, considering that
    CPU is not 100% used and index fits into memory?
  * How to deal with OOM and replicas going into recovery?
  * Is memory or CPU the main problem? (When searching on the internet,
    i never see CPU as main bottleneck for SOLR, but our case might be
    different)
  * Or do we need smaller shards? Could segments merging be a problem?
  * How to add faceting without search queries slowing down too much?
  * How to diagnose these problems and narrow down to the real reason in
    hardware or setup?

Any help would be much appreciated.

Thanks!

--
Email Signature
*Sofiia Strochyk
*


s...@interlogic.com.ua <mailto:s...@interlogic.com.ua>
    InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>
Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedInicon <https://www.linkedin.com/company/interlogic>


--
Email Signature
*Sofiia Strochyk
*


s...@interlogic.com.ua <mailto:s...@interlogic.com.ua>
        InterLogic
www.interlogic.com.ua <https://www.interlogic.com.ua>

Facebook icon <https://www.facebook.com/InterLogicOfficial> LinkedInicon <https://www.linkedin.com/company/interlogic>

Re: Re: SolrCloud scaling/optimization for high request rate

Reply via email to