Hi all

We run Solr 8.2.0
* with Amazon Corretto 11.0.5.10.1 SDK (java arguments shown in [1]),
* on Ubuntu 18.04
* on AWS EC2 m5.2xlarge with 8 CPUs and 32GB of RAM
* with -Xmx16g [1].

We have migrated from Solr 3.5 and in big core (16GB) replicas we have
started to suffer degraded service. The replica’s ReplicationHandler is in
[8] and the master’s updateHandler in [9].

We notice every 5 mins (the value for solr.autoCommit.maxTime) the
following:
* Solr uses all 8 CPUs. Suddenly for ~30 sec, it uses only 1 CPU at 100%
and the rest of the CPUs are idle (mpstat [6]). In our previous setup with
Solr 3 we used up to 80% of all CPUs.
* During that time the solr queries suddenly take more than 1 second, up to
30 sec (or more). The same queries otherwise need less than 1 sec to
complete.
* The disk does not seem to be a bottleneck (iostat [4]).
* Memory does not seem to be a bottleneck (vmstat [5]).
* CPU (apart from the single CPU issue) does not seem to be a bottleneck
(mpstat [6] & pidstat [3]).
* We are no java/GC experts but It does not seem to be GC related [7].

We have tried reducing the heap to 8 and 2GB with no positive effect. We
have tested different autoCommit.maxTime values. Reducing it to 60 seconds
makes things unbearable. 5 minutes is not significantly different than 10.
Do you have any pointers to proceed debugging the issue?

Detailed example problem that repeats every solr.autoCommit.maxTime minutes
on the replicas:
* From 12:36 to 12:39:04 queries are fast to serve [2]. Solr consumes CPU
from all 8 CPUs (mpstat [6]). The metric solr.jvm.threads.blocked.count is
0 [2].
* From 12:39:04 to 12:39:25 queries are slow to respond [2]. Solr consumes
only 1 out of 8 CPUs, the other 7 CPUs are idle (mpstat [6]). The metric
solr.jvm.threads.blocked.count grows from 0 to a big 2 digit number [2].
* After 12:39:25 and until the next poll of a commit things are normal.

Regards
Vangelis

[1]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-solr-info
[2]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-slow-queries-and-solr-jvm-threads-blocked-count
[3]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-pidstat
[4]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-iostat
[5]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-vmstat
[6]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-mpstat
[7]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-gc-logs
[8]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-replica-replicationhandler
[9]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-master-updatehandler

Reply via email to