Hi On Mon, Feb 10, 2020 at 5:05 PM Vangelis Katsikaros <vkatsika...@gmail.com> wrote:
> Hi all > > We run Solr 8.2.0 > * with Amazon Corretto 11.0.5.10.1 SDK (java arguments shown in [1]), > * on Ubuntu 18.04 > * on AWS EC2 m5.2xlarge with 8 CPUs and 32GB of RAM > * with -Xmx16g [1]. > > We have migrated from Solr 3.5 and in big core (16GB) replicas we have > started to suffer degraded service. The replica’s ReplicationHandler is in > [8] and the master’s updateHandler in [9]. > > We notice every 5 mins (the value for solr.autoCommit.maxTime) the > following: > * Solr uses all 8 CPUs. Suddenly for ~30 sec, it uses only 1 CPU at 100% > and the rest of the CPUs are idle (mpstat [6]). In our previous setup with > Solr 3 we used up to 80% of all CPUs. > * During that time the solr queries suddenly take more than 1 second, up > to 30 sec (or more). The same queries otherwise need less than 1 sec to > complete. > * The disk does not seem to be a bottleneck (iostat [4]). > * Memory does not seem to be a bottleneck (vmstat [5]). > * CPU (apart from the single CPU issue) does not seem to be a bottleneck > (mpstat [6] & pidstat [3]). > * We are no java/GC experts but It does not seem to be GC related [7]. > > We have tried reducing the heap to 8 and 2GB with no positive effect. We > have tested different autoCommit.maxTime values. Reducing it to 60 seconds > makes things unbearable. 5 minutes is not significantly different than 10. > Do you have any pointers to proceed debugging the issue? > > Detailed example problem that repeats every solr.autoCommit.maxTime > minutes on the replicas: > * From 12:36 to 12:39:04 queries are fast to serve [2]. Solr consumes CPU > from all 8 CPUs (mpstat [6]). The metric solr.jvm.threads.blocked.count is > 0 [2]. > * From 12:39:04 to 12:39:25 queries are slow to respond [2]. Solr consumes > only 1 out of 8 CPUs, the other 7 CPUs are idle (mpstat [6]). The metric > solr.jvm.threads.blocked.count grows from 0 to a big 2 digit number [2]. > * After 12:39:25 and until the next poll of a commit things are normal. > > Regards > Vangelis > > [1] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-solr-info > [2] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-slow-queries-and-solr-jvm-threads-blocked-count > [3] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-pidstat > [4] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-iostat > [5] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-vmstat > [6] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-mpstat > [7] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-gc-logs > [8] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-replica-replicationhandler > [9] > https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-master-updatehandler > Some additional information. We noticed (through the admin's "Thread Dump" /solr/#/~threads) that whenever we see this behavior the all the threads that block show the same stacktrace [10] and block at org.apache.solr.search.function.FileFloatSource$Cache.get(FileFloatSource.java:198) org.apache.solr.search.function.FileFloatSource.getCachedFloats(FileFloatSource.java:152) org.apache.solr.search.function.FileFloatSource.getValues(FileFloatSource.java:95) org.apache.lucene.queries.function.valuesource.MultiFloatFunction.getValues(MultiFloatFunction.java:76) org.apache.lucene.queries.function.ValueSource$WrappedDoubleValuesSource.getValues(ValueSource.java:203) org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource.getValues(FunctionScoreQuery.java:255) org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight.scorer(FunctionScoreQuery.java:218) ... The boostfiles (external_boostvalue) are ~30M large and the schema fields are configured in the schema [11] with: <field name="boostvalue" type="fileboost"/> Regards Vangelis [10] https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-stacktrace [11] https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-schema-boostfile