FYI to all, just as an update, we rebuilt the index in question from
scratch for a second time this weekend and the problem went away on 1 node,
but we were still seeing it on the other node. After restarting the
problematic node, the problem went away. Still makes me a little uneasy as
we weren't able to determine the cause, but at least we are back to normal
query times now.

Chris

On Fri, Jun 15, 2018 at 8:06 AM, Chris Troullis <cptroul...@gmail.com>
wrote:

> Thanks Shawn,
>
> As mentioned previously, we are hard committing every 60 seconds, which we
> have been doing for years, and have had no issues until enabling CDCR. We
> have never seen large tlog sizes before, and even manually issuing a hard
> commit to the collection does not reduce the size of the tlogs. I believe
> this is because when using the CDCRUpdateLog the tlogs are not purged until
> the docs have been replicated over. Anyway, since we manually purged the
> tlogs they seem to now be staying at an acceptable size, so I don't think
> that is the cause. The documents are not abnormally large, maybe ~20
> string/numeric fields with simple whitespace tokenization.
>
> To answer your questions:
>
> -Solr version: 7.2.1
> -What OS vendor and version Solr is running on: CentOS 6
> -Total document count on the server (counting all index cores): 13
> collections totaling ~60 million docs
> -Total index size on the server (counting all cores): ~60GB
> -What the total of all Solr heaps on the server is - 16GB heap (we had to
> increase for CDCR because it was using a lot more heap).
> -Whether there is software other than Solr on the server - No
> -How much total memory the server has installed - 64 GB
>
> All of this has been consistent for multiple years across multiple Solr
> versions and we have only started seeing this issue once we started using
> the CDCRUpdateLog and CDCR, hence why that is the only real thing we can
> point to. And again, the issue is only affecting 1 of the 13 collections on
> the server, so if it was hardware/heap/GC related then I would think we
> would be seeing it for every collection, not just one, as they all share
> the same resources.
>
> I will take a look at the GC logs, but I don't think that is the cause.
> The consistent nature of the slow performance doesn't really point to GC
> issues, and we have profiling set up in New Relic and it does not show any
> long/frequent GC pauses.
>
> We are going to try and rebuild the collection from scratch again this
> weekend as that has solved the issue in some lower environments, although
> it's not really consistent. At this point it's all we can think of to do.
>
> Thanks,
>
> Chris
>
>
> On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 6/12/2018 12:06 PM, Chris Troullis wrote:
>> > The issue we are seeing is with 1 collection in particular, after we
>> set up
>> > CDCR, we are getting extremely slow response times when retrieving
>> > documents. Debugging the query shows QTime is almost nothing, but the
>> > overall responseTime is like 5x what it should be. The problem is
>> > exacerbated by larger result sizes. IE retrieving 25 results is almost
>> > normal, but 200 results is way slower than normal. I can run the exact
>> same
>> > query multiple times in a row (so everything should be cached), and I
>> still
>> > see response times way higher than another environment that is not using
>> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just
>> that
>> > we are using the CDCRUpdateLog. The problem started happening even
>> before
>> > we enabled CDCR.
>> >
>> > In a lower environment we noticed that the transaction logs were huge
>> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
>> > restarting, and that seemed to fix the performance issue. We tried the
>> same
>> > thing in production the other day but it had no effect, so now I don't
>> know
>> > if it was a coincidence or not.
>>
>> There is one other cause besides CDCR buffering that I know of for huge
>> transaction logs, and it has nothing to do with CDCR:  A lack of hard
>> commits.  It is strongly recommended to have autoCommit set to a
>> reasonably short interval (about a minute in my opinion, but 15 seconds
>> is VERY common).  Most of the time openSearcher should be set to false
>> in the autoCommit config, and other mechanisms (which might include
>> autoSoftCommit) should be used for change visibility.  The example
>> autoCommit settings might seem superfluous because they don't affect
>> what's searchable, but it is actually a very important configuration to
>> keep.
>>
>> Are the docs in this collection really big, by chance?
>>
>> As I went through previous threads you've started on the mailing list, I
>> have noticed that none of your messages provided some details that would
>> be useful for looking into performance problems:
>>
>>  * What OS vendor and version Solr is running on.
>>  * Total document count on the server (counting all index cores).
>>  * Total index size on the server (counting all cores).
>>  * What the total of all Solr heaps on the server is.
>>  * Whether there is software other than Solr on the server.
>>  * How much total memory the server has installed.
>>
>> If you name the OS, I can use that information to help you gather some
>> additional info which will actually show me most of that list.  Total
>> document count is something that I cannot get from the info I would help
>> you gather.
>>
>> Something else that can cause performance issues is GC pauses.  If you
>> provide a GC log (The script that starts Solr logs this by default), we
>> can analyze it to see if that's a problem.
>>
>> Attachments to messages on the mailing list typically do not make it to
>> the list, so a file sharing website is a better way to share large
>> logfiles.  A paste website is good for log data that's smaller.
>>
>> Thanks,
>> Shawn
>>
>>
>

Reply via email to