
On Mon, Feb 10, 2020 at 5:05 PM Vangelis Katsikaros <vkatsika...@gmail.com>

> Hi all
> We run Solr 8.2.0
> * with Amazon Corretto SDK (java arguments shown in [1]),
> * on Ubuntu 18.04
> * on AWS EC2 m5.2xlarge with 8 CPUs and 32GB of RAM
> * with -Xmx16g [1].
> We have migrated from Solr 3.5 and in big core (16GB) replicas we have
> started to suffer degraded service. The replica’s ReplicationHandler is in
> [8] and the master’s updateHandler in [9].
> We notice every 5 mins (the value for solr.autoCommit.maxTime) the
> following:
> * Solr uses all 8 CPUs. Suddenly for ~30 sec, it uses only 1 CPU at 100%
> and the rest of the CPUs are idle (mpstat [6]). In our previous setup with
> Solr 3 we used up to 80% of all CPUs.
> * During that time the solr queries suddenly take more than 1 second, up
> to 30 sec (or more). The same queries otherwise need less than 1 sec to
> complete.
> * The disk does not seem to be a bottleneck (iostat [4]).
> * Memory does not seem to be a bottleneck (vmstat [5]).
> * CPU (apart from the single CPU issue) does not seem to be a bottleneck
> (mpstat [6] & pidstat [3]).
> * We are no java/GC experts but It does not seem to be GC related [7].
> We have tried reducing the heap to 8 and 2GB with no positive effect. We
> have tested different autoCommit.maxTime values. Reducing it to 60 seconds
> makes things unbearable. 5 minutes is not significantly different than 10.
> Do you have any pointers to proceed debugging the issue?
> Detailed example problem that repeats every solr.autoCommit.maxTime
> minutes on the replicas:
> * From 12:36 to 12:39:04 queries are fast to serve [2]. Solr consumes CPU
> from all 8 CPUs (mpstat [6]). The metric solr.jvm.threads.blocked.count is
> 0 [2].
> * From 12:39:04 to 12:39:25 queries are slow to respond [2]. Solr consumes
> only 1 out of 8 CPUs, the other 7 CPUs are idle (mpstat [6]). The metric
> solr.jvm.threads.blocked.count grows from 0 to a big 2 digit number [2].
> * After 12:39:25 and until the next poll of a commit things are normal.
> Regards
> Vangelis
> [1]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-solr-info
> [2]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-slow-queries-and-solr-jvm-threads-blocked-count
> [3]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-pidstat
> [4]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-iostat
> [5]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-vmstat
> [6]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-mpstat
> [7]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-gc-logs
> [8]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-replica-replicationhandler
> [9]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-master-updatehandler

Some additional information. We noticed (through the admin's "Thread Dump"
/solr/#/~threads) that whenever we see this behavior the all the threads
that block show the same stacktrace [10] and block at


The boostfiles (external_boostvalue) are ~30M large and the schema fields
are configured in the schema [11] with:
  <field name="boostvalue" type="fileboost"/>



Reply via email to