FYI to all, just as an update, we rebuilt the index in question from scratch for a second time this weekend and the problem went away on 1 node, but we were still seeing it on the other node. After restarting the problematic node, the problem went away. Still makes me a little uneasy as we weren't able to determine the cause, but at least we are back to normal query times now.
Chris On Fri, Jun 15, 2018 at 8:06 AM, Chris Troullis <cptroul...@gmail.com> wrote: > Thanks Shawn, > > As mentioned previously, we are hard committing every 60 seconds, which we > have been doing for years, and have had no issues until enabling CDCR. We > have never seen large tlog sizes before, and even manually issuing a hard > commit to the collection does not reduce the size of the tlogs. I believe > this is because when using the CDCRUpdateLog the tlogs are not purged until > the docs have been replicated over. Anyway, since we manually purged the > tlogs they seem to now be staying at an acceptable size, so I don't think > that is the cause. The documents are not abnormally large, maybe ~20 > string/numeric fields with simple whitespace tokenization. > > To answer your questions: > > -Solr version: 7.2.1 > -What OS vendor and version Solr is running on: CentOS 6 > -Total document count on the server (counting all index cores): 13 > collections totaling ~60 million docs > -Total index size on the server (counting all cores): ~60GB > -What the total of all Solr heaps on the server is - 16GB heap (we had to > increase for CDCR because it was using a lot more heap). > -Whether there is software other than Solr on the server - No > -How much total memory the server has installed - 64 GB > > All of this has been consistent for multiple years across multiple Solr > versions and we have only started seeing this issue once we started using > the CDCRUpdateLog and CDCR, hence why that is the only real thing we can > point to. And again, the issue is only affecting 1 of the 13 collections on > the server, so if it was hardware/heap/GC related then I would think we > would be seeing it for every collection, not just one, as they all share > the same resources. > > I will take a look at the GC logs, but I don't think that is the cause. > The consistent nature of the slow performance doesn't really point to GC > issues, and we have profiling set up in New Relic and it does not show any > long/frequent GC pauses. > > We are going to try and rebuild the collection from scratch again this > weekend as that has solved the issue in some lower environments, although > it's not really consistent. At this point it's all we can think of to do. > > Thanks, > > Chris > > > On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 6/12/2018 12:06 PM, Chris Troullis wrote: >> > The issue we are seeing is with 1 collection in particular, after we >> set up >> > CDCR, we are getting extremely slow response times when retrieving >> > documents. Debugging the query shows QTime is almost nothing, but the >> > overall responseTime is like 5x what it should be. The problem is >> > exacerbated by larger result sizes. IE retrieving 25 results is almost >> > normal, but 200 results is way slower than normal. I can run the exact >> same >> > query multiple times in a row (so everything should be cached), and I >> still >> > see response times way higher than another environment that is not using >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just >> that >> > we are using the CDCRUpdateLog. The problem started happening even >> before >> > we enabled CDCR. >> > >> > In a lower environment we noticed that the transaction logs were huge >> > (multiple gigs), so we tried stopping solr and deleting the tlogs then >> > restarting, and that seemed to fix the performance issue. We tried the >> same >> > thing in production the other day but it had no effect, so now I don't >> know >> > if it was a coincidence or not. >> >> There is one other cause besides CDCR buffering that I know of for huge >> transaction logs, and it has nothing to do with CDCR: A lack of hard >> commits. It is strongly recommended to have autoCommit set to a >> reasonably short interval (about a minute in my opinion, but 15 seconds >> is VERY common). Most of the time openSearcher should be set to false >> in the autoCommit config, and other mechanisms (which might include >> autoSoftCommit) should be used for change visibility. The example >> autoCommit settings might seem superfluous because they don't affect >> what's searchable, but it is actually a very important configuration to >> keep. >> >> Are the docs in this collection really big, by chance? >> >> As I went through previous threads you've started on the mailing list, I >> have noticed that none of your messages provided some details that would >> be useful for looking into performance problems: >> >> * What OS vendor and version Solr is running on. >> * Total document count on the server (counting all index cores). >> * Total index size on the server (counting all cores). >> * What the total of all Solr heaps on the server is. >> * Whether there is software other than Solr on the server. >> * How much total memory the server has installed. >> >> If you name the OS, I can use that information to help you gather some >> additional info which will actually show me most of that list. Total >> document count is something that I cannot get from the info I would help >> you gather. >> >> Something else that can cause performance issues is GC pauses. If you >> provide a GC log (The script that starts Solr logs this by default), we >> can analyze it to see if that's a problem. >> >> Attachments to messages on the mailing list typically do not make it to >> the list, so a file sharing website is a better way to share large >> logfiles. A paste website is good for log data that's smaller. >> >> Thanks, >> Shawn >> >> >