Is this collection anyway drastically different than others in terms of schema/# of fields/total document etc is it sharded and if so can you look which shard taking more time with shard.info=true.
Thnx Susheel On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis <cptroul...@gmail.com> wrote: > Thanks Erick, > > Seems to be a mixed bag in terms of tlog size across all of our indexes, > but currently the index with the performance issues has 4 tlog files > totally ~200 MB. This still seems high to me since the collections are in > sync, and we hard commit every minute, but it's less than the ~8GB it was > before we cleaned them up. Spot checking some other indexes show some have > tlogs >3GB, but none of those indexes are having performance issues (on the > same solr node), so I'm not sure it's related. We have 13 collections of > various sizes running on our solr cloud cluster, and none of them seem to > have this issue except for this one index, which is not our largest index > in terms of size on disk or number of documents. > > As far as the response intervals, just running a default search *:* sorting > on our id field so that we get consistent results across environments, and > returning 200 results (our max page size in app) with ~20 fields, we see > times of ~3.5 seconds in production, compared to ~1 second on one of our > lower environments with an exact copy of the index. Both have CDCR enabled > and have identical clusters. > > Unfortunately, currently the only instance we are seeing the issue on is > production, so we are limited in the tests that we can run. I did confirm > in the lower environment that the doc cache is large enough to hold all of > the results, and that both the doc and query caches should be serving the > results. Obviously production we have much more indexing going on, but we > do utilize autowarming for our caches so our response times are still > stable across new searchers. > > We did move the lower environment to the same ESX host as our production > cluster, so that it is getting resources from the same pool (CPU, RAM, > etc). The only thing that is different is the disks, but the lower > environment is running on slower disks than production. And if it was a > disk issue you would think it would be affecting all of the collections, > not just this one. > > It's a mystery! > > Chris > > > > On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > First, nice job of eliminating all the standard stuff! > > > > About tlogs: Sanity check: They aren't growing again, right? They > > should hit a relatively steady state. The tlogs are used as a queueing > > mechanism for CDCR to durably store updates until they can > > successfully be transmitted to the target. So I'd expect them to hit a > > fairly steady number. > > > > Your lack of CPU/IO spikes is also indicative of something weird, > > somehow Solr just sitting around doing nothing. What intervals are we > > talking about here for response? 100ms? 5000ms? > > > > When you hammer the same query over and over, you should see your > > queryResultCache hits increase. If that's the case, Solr is doing no > > work at all for the search, just assembling the resopnse packet which, > > as you say, should be in the documentCache. This assumes it's big > > enough to hold all of the docs that are requested by all the > > simultaneous requests. The queryResultCache cache will be flushed > > every time a new searcher is opened. So if you still get your poor > > response times, and your queryResultCache hits are increasing then > > Solr is doing pretty much nothing. > > > > So does this behavior still occur if you aren't adding docs to the > > index? If you turn indexing off as a test, that'd be another data > > point. > > > > And, of course, if it's at all possible to just take the CDCR > > configuration out of your solrconfig file temporarily that'd nail > > whether CDCR is the culprit or whether it's coincidental. You say that > > CDCR is the only difference between the environments, but I've > > certainly seen situations where it turns out to be a bad disk > > controller or something that's _also_ different. > > > > Now, assuming all that's inconclusive, I'm afraid the next step would > > be to throw a profiler at it. Maybe pull a stack traces. > > > > Best, > > Erick > > > > On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis <cptroul...@gmail.com> > > wrote: > > > Thanks Erick. A little more info: > > > > > > -We do have buffering disabled everywhere, as I had read multiple posts > > on > > > the mailing list regarding the issue you described. > > > -We soft commit (with opensearcher=true) pretty frequently (15 seconds) > > as > > > we have some NRT requirements. We hard commit every 60 seconds. We > never > > > commit manually, only via the autocommit timers. We have been using > these > > > settings for a long time and have never had any issues until recently. > > And > > > all of our other indexes are fine (some larger than this one). > > > -We do have documentResultCache enabled, although it's not very big. > But > > I > > > can literally spam the same query over and over again with no other > > queries > > > hitting the box, so all the results should be cached. > > > -We don't see any CPU/IO spikes when running these queries, our load is > > > pretty much flat on all accounts. > > > > > > I know it seems odd that CDCR would be the culprit, but it's really the > > > only thing we've changed, and we have other environments running the > > exact > > > same setup with no issues, so it is really making us tear our hair out. > > And > > > when we cleaned up the huge tlogs it didn't seem to make any difference > > in > > > the query time (I was originally thinking it was somehow searching > > through > > > the tlogs for documents, and that's why it was taking so long to > retrieve > > > the results, but I don't know if that is actually how it works). > > > > > > Are you aware of any logger settings we could increase to potentially > > get a > > > better idea of where the time is being spent? I took the eventual query > > > response and just hosted as a static file on the same machine via nginx > > and > > > it downloaded lightning fast (I was trying to rule out network as the > > > culprit), so it seems like the time is being spent somewhere in solr. > > > > > > Thanks, > > > Chris > > > > > > On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson < > erickerick...@gmail.com > > > > > > wrote: > > > > > >> Having the tlogs be huge is a red flag. Do you have buffering enabled > > >> in CDCR? This was something of a legacy option that's going to be > > >> removed, it's been made obsolete by the ability of CDCR to bootstrap > > >> the entire index. Buffering should be disabled always. > > >> > > >> Another reason tlogs can grow is if you have very long times between > > >> hard commits. I doubt that's your issue, but just in case. > > >> > > >> And the final reason tlogs can grow is that the connection between > > >> source and target clusters is broken, but that doesn't sound like what > > >> you're seeing either since you say the target cluster is keeping up. > > >> > > >> The process of assembling the response can be long. If you have any > > >> stored fields (and not docValues-enabled), Solr will > > >> 1> seek the stored data on disk > > >> 2> decompress (min 16K blocks) > > >> 3> transmit the thing back to your client > > >> > > >> The decompressed version of the doc will be held in the > > >> documentResultCache configured in solrconfig.xml, so it may or may not > > >> be cached in memory. That said, this stuff is all MemMapped and the > > >> decompression isn't usually an issue, I'd expect you to see very large > > >> CPU spikes and/or I/O contention if that was the case. > > >> > > >> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will > > >> have to look in the tlogs to get you the very most recent copy, so the > > >> first place I'd look is keeping the tlogs under control first. > > >> > > >> The other possibility (again unrelated to CDCR) is if your spikes are > > >> coincident with soft commits or hard-commits-with-opensearcher-true. > > >> > > >> In all, though, none of the usual suspects seems to make sense here > > >> since you say that absent configuring CDCR things seem to run fine. So > > >> I'd look at the tlogs and my commit intervals. Once the tlogs are > > >> under control then move on to other possibilities if the problem > > >> persists... > > >> > > >> Best, > > >> Erick > > >> > > >> > > >> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis < > cptroul...@gmail.com> > > >> wrote: > > >> > Hi all, > > >> > > > >> > Recently we have gone live using CDCR on our 2 node solr cloud > cluster > > >> > (7.2.1). From a CDCR perspective, everything seems to be working > > >> > fine...collections are staying in sync across the cluster, > everything > > >> looks > > >> > good. > > >> > > > >> > The issue we are seeing is with 1 collection in particular, after we > > set > > >> up > > >> > CDCR, we are getting extremely slow response times when retrieving > > >> > documents. Debugging the query shows QTime is almost nothing, but > the > > >> > overall responseTime is like 5x what it should be. The problem is > > >> > exacerbated by larger result sizes. IE retrieving 25 results is > almost > > >> > normal, but 200 results is way slower than normal. I can run the > exact > > >> same > > >> > query multiple times in a row (so everything should be cached), and > I > > >> still > > >> > see response times way higher than another environment that is not > > using > > >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just > > that > > >> > we are using the CDCRUpdateLog. The problem started happening even > > before > > >> > we enabled CDCR. > > >> > > > >> > In a lower environment we noticed that the transaction logs were > huge > > >> > (multiple gigs), so we tried stopping solr and deleting the tlogs > then > > >> > restarting, and that seemed to fix the performance issue. We tried > the > > >> same > > >> > thing in production the other day but it had no effect, so now I > don't > > >> know > > >> > if it was a coincidence or not. > > >> > > > >> > Things that we have tried: > > >> > > > >> > -Completely deleting the collection and rebuilding from scratch > > >> > -Running the query directly from solr admin to eliminate other > causes > > >> > -Doing a tcpdump on the solr node to eliminate a network issue > > >> > > > >> > None of these things have yielded any results. It seems very > > >> inconsistent. > > >> > Some environments we can reproduce it in, others we can't. > > >> > Hardware/configuration/network is exactly the same between all > > >> > envrionments. The only thing that we have narrowed it down to is we > > are > > >> > pretty sure it has something to do with CDCR, as the issue only > > started > > >> > when we started using it. > > >> > > > >> > I'm wondering if any of this sparks any ideas from anyone, or if > > people > > >> > have suggestions as to how I can figure out what is causing this > long > > >> query > > >> > response time? The debug flag on the query seems more geared towards > > >> seeing > > >> > where time is spent in the actual query, which is nothing in my > case. > > The > > >> > time is spent retrieving the results, which I don't have much > > information > > >> > on. I have tried increasing the log level but nothing jumps out at > me > > in > > >> > the solr logs. Is there something I can look for specifically to > help > > >> debug > > >> > this? > > >> > > > >> > Thanks, > > >> > > > >> > Chris > > >> > > >