Re: Re: Delayed/waiting requests

Erick Erickson Wed, 16 Jan 2019 08:40:38 -0800

I don't know of any tools to inspect the cache. Under the covers,
these are things like Java's ConcurrentHashMap which don't, for
instance, carry along information like last access time IIUC.


I usually have to cull the Solr logs and eyeball the fq clauses to see
if anything jumps out. If you do find any such patterns, you can
always add {!cache=false} to those clauses to not use up cache
entries....

Best,
Erick

On Wed, Jan 16, 2019 at 7:53 AM Gael Jourdan-Weil
<gael.jourdan-w...@kelkoogroup.com> wrote:
>
> Ok, I get your point.
>
>
> Do you know if there is a tool to easily view filterCache content?
>
> I know we can see the top entries in the API or the UI but could we see more?
>
>
> Regards,
>
> Gaël
>
> ________________________________
> De : Erick Erickson <erickerick...@gmail.com>
> Envoyé : mardi 15 janvier 2019 19:46:19
> À : solr-user
> Objet : Re: Re: Delayed/waiting requests
>
> bq. If I get your point, having a big cache might cause more troubles
> than help if the cache hit ratio is not high enough because the cache
> is constantly evicting/inserting entries?
>
> Pretty much. Although there are nuances.
>
> Right now, you have a 12K autowarm count. That means your cache will
> eventually always contain 12K entries whether or not you ever use the
> last 11K! I'm simplifying a bit, but it grows like this.
>
> Let's say I start Solr. Initially it has no cache entries. Now I start
> both querying and indexing. For simplicity, say I have 100 _new_  fq
> clauses come in between each commit. The first commit will autowarm
> 100. The next will autowarm 200, then 300...... etc. Eventually this
> will grow to 12K. So your performance will start to vary depending on
> how long Solr has been running.
>
> Worse. it's not clear that you _ever_ re-use those clauses. One example:
> fq=date_field:[* TO NOW]
> NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
> from the first one will not re-use the entry. In the worst case almost
> all of your autwarming is useless. It neither loads relevant index
> data into RAM nor is reusable.
>
> Even if you use "date math" to round to, say, a minute, if you run
> Solr long enough you'll still fill up with useless fq clauses.
>
> Best,
> Erick
>
> On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
> <gael.jourdan-w...@kelkoogroup.com> wrote:
> >
> > @Erick:
> >
> >
> > We will try to lower the autowarm and run some tests to compare.
> >
> > If I get your point, having a big cache might cause more troubles than help 
> > if the cache hit ratio is not high enough because the cache is constantly 
> > evicting/inserting entries?
> >
> >
> >
> > @Jeremy:
> >
> >
> > Index size: ~20G and ~14M documents
> >
> > Server memory available: 256G from which ~30G used and ~100G system cache
> >
> > Server CPU count: 32, ~10% usage
> >
> > JVM memory settings: -Xms12G -Xmx12G
> >
> >
> > We have 3 servers and 3 clusters of 3 Solr instances.
> >
> > That is each server hosts 1 Solr instance for each cluster.
> >
> > And, indeed, each cluster only has 1 shard with replication factor 3.
> >
> >
> > Among all these Solr instances, the pauses are observed on only one single 
> > cluster but on every server at different times (sometimes on all servers at 
> > the same time but I would say it's very rare).
> >
> > We do observe the traffic is evenly balanced across the 3 servers, around 
> > 30-40 queries per second sent to each server.
> >
> >
> >
> > Regards,
> >
> > Gaël
> >
> >
> > ________________________________
> > De : Branham, Jeremy (Experis) <jb...@allstate.com>
> > Envoyé : mardi 15 janvier 2019 17:59:56
> > À : solr-user@lucene.apache.org
> > Objet : Re: Re: Delayed/waiting requests
> >
> > Hi Gael –
> >
> > Could you share this information?
> > Size of the index
> > Server memory available
> > Server CPU count
> > JVM memory settings
> >
> > You mentioned a cloud configuration of 3 replicas.
> > Does that mean you have 1 shard with a replication factor of 3?
> > Do the pauses occur on all 3 servers?
> > Is the traffic evenly balanced across those servers?
> >
> >
> > Jeremy Branham
> > jb...@allstate.com
> >
> >
> > On 1/15/19, 9:50 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:
> >
> >     Well, it was a nice theory anyway.
> >
> >     "Other collections with the same settings"
> >     doesn't really mean much unless those other collections are very 
> > similar,
> >     especially in terms of numbers of docs.
> >
> >     You should only see a new searcher opening when you do a
> >     hard-commit-with-opensearcher-true or soft commit.
> >
> >     So what happens when you just try lowering the autowarm
> >     count? I'm assuming you're free to test in some non-prod
> >     system.
> >
> >     Focusing on the hit ratio is something of a red herring. Remember
> >     that each entry in your filterCache is roughly maxDoc/8 + a little
> >     overhead, the increase in GC pressure has to be balanced
> >     against getting the hits from the cache.
> >
> >     Now, all that said if there's no correlation, then you need to put
> >     a profiler on the system when you see this kind of thing and
> >     find out where the hotspots are, otherwise it's guesswork and
> >     I'm out of ideas.
> >
> >     Best,
> >     Erick
> >
> >     On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
> >     <gael.jourdan-w...@kelkoogroup.com> wrote:
> >     >
> >     > Hi Erick,
> >     >
> >     >
> >     > Thank you for your detailed answer, I better understand autowarming.
> >     >
> >     >
> >     > We have an autowarming time of ~10s for filterCache (queryResultCache 
> > is not used at all, ratio = 0.02).
> >     >
> >     > We increased the size of the filterCache from 6k to 12k (and 
> > autowarming size set to same values) to have a better ratio which is _only_ 
> > around 0.85/0.90.
> >     >
> >     >
> >     > The thing I don't understand is I should see "Opening new searcher" 
> > in the logs everytime a new searcher is opened and thus an autowarming 
> > happens, right?
> >     >
> >     > But I don't see "Opening new searcher" very often, and I don't see it 
> > being correlated with the response time peaks.
> >     >
> >     >
> >     > Also, I didn't mention it earlier but, we have other SolrCloud 
> > clusters with similar settings and load (~10s filterCache autowarming, 10k 
> > entries) and we don't observe the same behavior.
> >     >
> >     >
> >     > Regards,
> >     >
> >     > ________________________________
> >     > De : Erick Erickson <erickerick...@gmail.com>
> >     > Envoyé : lundi 14 janvier 2019 17:44:38
> >     > À : solr-user
> >     > Objet : Re: Delayed/waiting requests
> >     >
> >     > Gael:
> >     >
> >     > bq. Nevertheless, our filterCache is set to autowarm 12k entries which
> >     > is also the maxSize
> >     >
> >     > That is far, far, far too many. Let's assume you actually have 12K
> >     > entries in the filterCache.
> >     > Every time you open a new searcher, 12K queries are executed _before_
> >     > the searcher
> >     > accepts any new requests. While being able to re-use a filterCache
> >     > entry is useful, one of
> >     > the primary purposes is to pre-load index data from disk into memory
> >     > which can be
> >     > the event that takes the most time.
> >     >
> >     > The queryResultCache has a similar function. I often find that this
> >     > cache doesn't have a
> >     > very high hit ratio, but again executing a _few_ of these queries
> >     > warms the index from
> >     > disk.
> >     >
> >     > I think of both caches as a map, where the key is the "thing", (fq
> >     > clause in the case
> >     > of filterCache, the whole query in the case of the queryResultCache).
> >     > Autowarming
> >     > replays the most recently executed N of these entries, essentially
> >     > just as though
> >     > they were submitted by a user.
> >     >
> >     > Hypothesis: You're massively over-warming, and when that kicks in 
> > you're seeing
> >     > increased CPU and GC pressure leading to the anomalies you're seeing. 
> > Further,
> >     > you have such excessive autowarming going on that it's hard to see the
> >     > associated messages in the log.
> >     >
> >     > Here's what I'd recommend: Set your autowarm counts to something on 
> > the order
> >     > of 16. If the culprit is just excessive autowarming, I'd expect your 
> > spikes to
> >     > be much less severe. It _might_ be that your users see some increased 
> > (very
> >     > temporary) variance in response time. You can tell that the 
> > autowarming
> >     > configurations are "more art than science", I can't give you any other
> >     > recommendations than "start small and increase until you're happy"
> >     > unfortunately.
> >     >
> >     > I usually do this with some kind of load tester in a dev lab of 
> > course ;).
> >     >
> >     > Finally, if you use the metrics data (see:
> >     > 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F1_metrics-2Dreporting.html&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=9BWTVr32mplsfAWQ3hnWuVx5V1cL_RgLNDDpg8S2mtk&e=)
> >     > you can see the autowarm times. Don't get too lost in the page to
> >     > start, just hit the 
> > "https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_admin_metrics&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=F8ahrx-8r_hWi7F1M0fHZc-_gEqsmr-yPvSZlGYX5rw&e=";
> >  endpoint
> >     > and look for "warmupTime", then refine on how to get _only_
> >     > the warmup stats ;).
> >     >
> >     > Best,
> >     > Erick
> >     >
> >     > On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
> >     > <gael.jourdan-w...@kelkoogroup.com> wrote:
> >     > >
> >     > > I had a look to GC logs this morning but I'm not sure how to 
> > interpret them.
> >     > >
> >     > >
> >     > > Over a period of 54mn, there is:
> >     > >
> >     > > - Number of pauses: 2739
> >     > >
> >     > > - Accumulated pauses: 93s => that is 2.86% of the time
> >     > >
> >     > > - Average pause duration: 0.03s
> >     > >
> >     > > - Average pause interval: 1.18s
> >     > >
> >     > > - Accumulated full GC: 0
> >     > >
> >     > > I'm not sure if this is a lot or not. What do you think ?
> >     > >
> >     > >
> >     > > Looking more closely to GC logs with GC Viewer, I can notice that 
> > the high response time peaks happens at the same time where GC pauses takes 
> > 2x more time (around 0.06s) than average.
> >     > >
> >     > >
> >     > > Also we are indeed indexing at the same time but we have 
> > autowarming set.
> >     > >
> >     > > I don't see any Searcher opened at the time we experience slowness.
> >     > >
> >     > > Nevertheless, our filterCache is set to autowarm 12k entries which 
> > is also the maxSize.
> >     > >
> >     > > Could this have any downside?
> >     > >
> >     > >
> >     > > Thanks,
> >     > >
> >     > > Gaël
> >     > >
> >     > >
> >     > > ________________________________
> >     > > De : Erick Erickson <erickerick...@gmail.com>
> >     > > Envoyé : vendredi 11 janvier 2019 17:21
> >     > > À : solr-user
> >     > > Objet : Re: Delayed/waiting requests
> >     > >
> >     > > Jimi's comment is one of the very common culprits.
> >     > >
> >     > > Autowarming is another. Are you indexing at the same
> >     > > time? If so it could well be  you aren't autowarming and
> >     > > the spikes are caused by using a new IndexSearcher
> >     > > that has to read much of the index off disk when commits
> >     > > happen. The "smoking gun" here would be if the spikes
> >     > > correlate to your commits (soft or hard-with-opensearcher-true).
> >     > >
> >     > > Best,
> >     > > Erick
> >     > >
> >     > > On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
> >     > > <gael.jourdan-w...@kelkoogroup.com> wrote:
> >     > > >
> >     > > > Interesting indeed, we did not see anything with VisualVM but 
> > having a look at the GC logs could gives us more info, especially on the 
> > pauses.
> >     > > >
> >     > > > I will collect data over the week-end and look at it.
> >     > > >
> >     > > >
> >     > > > Thanks
> >     > > >
> >     > > > ________________________________
> >     > > > De : Hullegård, Jimi <jimi.hulleg...@svensktnaringsliv.se>
> >     > > > Envoyé : vendredi 11 janvier 2019 03:46:02
> >     > > > À : solr-user@lucene.apache.org
> >     > > > Objet : Re: Delayed/waiting requests
> >     > > >
> >     > > > Could be caused by garbage collection in the jvm.
> >     > > >
> >     > > > 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_SolrPerformanceProblems&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=csud3Q2nMknKdqj5WMFHMA7mK8nDBNlxAmbDafyp2us&e=
> >     > > >
> >     > > > Go down to the segment called “GC pause problems”
> >     > > >
> >     > > > /Jimi
> >     > > >
> >     > > > Sent from my iPhone
> >     > > >
> >     > > > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil 
> > <gael.jourdan-w...@kelkoogroup.com<mailto:gael.jourdan-w...@kelkoogroup.com>>
> >  wrote:
> >     > > >
> >     > > > Hello,
> >     > > >
> >     > > > We are experiencing some performance issues on a simple SolrCloud 
> > cluster of 3 replicas (1 core) but what we found during our analysis seems 
> > a bit odd, so we thought the community could have relevant ideas on this.
> >     > > >
> >     > > > Load: between 30 and 40 queries per second, constant over time of 
> > analysis
> >     > > >
> >     > > > Symptoms: high response time over short period of time but quite 
> > frequently.
> >     > > > We are talking about requests response time going from 50ms to 
> > 5000ms or even worse during less than 5 seconds, and then going back to 
> > normal.
> >     > > >
> >     > > > What we found out: just before response time increase, requests 
> > seems to be delayed.
> >     > > > That is during 2/3 seconds, requests pile up, no response is 
> > sent, and then all requests are resolved and responses are all returned to 
> > the clients at the same time.
> >     > > > Very much like if there was a lock happening somewhere. But we 
> > found no "lock" time nor at JVM or system level.
> >     > > >
> >     > > > Does someone can think of something that could explain this in 
> > the way Solr works ?
> >     > > > Or ideas to track down the root cause..
> >     > > >
> >     > > > Solr version is 7.2.1.
> >     > > >
> >     > > > Thanks for reading,
> >     > > >
> >     > > > Gaël Jourdan-Weil
> >     > > >
> >     > > > Svenskt Näringsliv behandlar dina personuppgifter i enlighet med 
> > GDPR. Här kan du läsa mer om vår behandling och dina rättigheter, 
> > Integritetspolicy<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.svensktnaringsliv.se_dataskydd_integritet-2Doch-2Dbehandling-2Dav-2Dpersonuppgifter-5F697219.html-3Futm-5Fsource-3Dsn-2Demail-26utm-5Fmedium-3Demail&d=DwIFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0&m=h6jTb9n4NnmdKzYWrvtmR4Hx9AKJvlxPH538vyXpE30&s=YaT3hPS_GQ4_rKBEfndYZC583WM6K6boCdbEJKTQirA&e=>
> >
> >

Re: Re: Delayed/waiting requests

Reply via email to