I conducted another experiment today with local SSD drives, but this did
not seem to fix my problem.
Don't see any extensive I/O in this case:


Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn

xvda              1.76        88.83         5.52    1256191      77996

xvdb             13.95       111.30     56663.93    1573961  801303364

xvdb - is the device where SolrCloud is installed and data files are kept.

What I see:
- There are 17 "Lucene Merge Thread #..." running. Some of them are
blocked, some of them are RUNNING
- updateExecutor-N-thread-M threads are in parked mode and number of docs
that I am able to submit is still low
- Tried to change maxIndexingThreads, set it to something high. This seems
to prolong the time when cluster is accepting new indexing requests and
keeps CPU utilization a lot higher while the cluster is merging indexes

Could anyone please point me to the right direction (documentation or Java
classes) where I can read about how data is passed from updateExecutor
thread pool to Merge Threads? I assume there should be some internal
blocking queue or something similar.
Still cannot wrap my head around how Solr blocks incoming connections. Non
merged indexes are not kept in memory so I don't clearly understand why
Solr cannot keep writing index file to HDD while other threads are merging
indexes (since this is a continuous process anyway).

Does anyone use SPM monitoring tool for that type of problems? Is it of any
use at all?


Thank you in advance.

[image: image.png]


Regards,
Denis


On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev <demic...@gmail.com> wrote:

> Mikhail,
>
> Sure, I will keep everyone posted. Moving to non-HVM instance may take
> some time, so hopefully I will be able to share my observations in the next
> couple of days or so.
> Thanks again for all the help.
>
> Regards,
> Denis
>
>
> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev <m...@apache.org> wrote:
>
>> Denis, please let me know what it ends up with. I'm really curious
>> regarding this case and AWS instace flavours. fwiw since 7.4 we'll have
>> ioThrottle=false option.
>>
>> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev <demic...@gmail.com>
>> wrote:
>>
>>> Mikhail, Erick,
>>>
>>> Thank you.
>>>
>>> What just occurred to me - we don't use local SSD but instead we're
>>> using EBS volumes.
>>> This was a wrong instance type that I looked at.
>>> Will try to set up a cluster with SSD nodes and retest.
>>>
>>> Regards,
>>> Denis
>>>
>>>
>>> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev <m...@apache.org>
>>> wrote:
>>>
>>>> I'm not sure it's the right context, but here is one guy shows really
>>>> low throthle boundary
>>>>
>>>> https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348
>>>>
>>>>
>>>> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev <m...@apache.org>
>>>> wrote:
>>>>
>>>>> Threads are hanging on merge io throthling
>>>>>
>>>>>         at 
>>>>> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
>>>>>         at 
>>>>> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
>>>>>         at 
>>>>> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
>>>>>         at 
>>>>> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
>>>>>
>>>>> It seems odd. Please confirm that you don't commit on every update
>>>>> request.
>>>>> The only way to monitor io throthling is to enable infostream and read
>>>>> a lot of logs.
>>>>>
>>>>>
>>>>> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Erick,
>>>>>>
>>>>>> Thank you for your quick response.
>>>>>>
>>>>>> I/O bottleneck: Please see another screenshot attached, as you can
>>>>>> see disk r/w operations are pretty low or not significant.
>>>>>> iostat==========
>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> xvda              0.00     0.00    0.00    0.00     0.00     0.00
>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>
>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>           12.52    0.00    0.00    0.00    0.00   87.48
>>>>>>
>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> xvda              0.00     0.00    0.00    0.00     0.00     0.00
>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>
>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>           12.51    0.00    0.00    0.00    0.00   87.49
>>>>>>
>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> xvda              0.00     0.00    0.00    0.00     0.00     0.00
>>>>>> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>> ==========================
>>>>>>
>>>>>> Merging threads: I don't see any modifications of a merging policy
>>>>>> comparing to the default solrconfig.
>>>>>> Index config:
>>>>>> <ramBufferSizeMB>2000</ramBufferSizeMB><maxBufferedDocs>500000</maxBufferedDocs>
>>>>>> Update handler: <updateHandler class="solr.DirectUpdateHandler2">
>>>>>> Could you please help me understand how can I validate this theory?
>>>>>> Another note here. Even if I remove the stress from the cluster I
>>>>>> still see that merging thread is consuming CPU for some time. It may take
>>>>>> hours and if I try to return the stress back nothing changes.
>>>>>> If this is overloaded merging process, it should take some time to
>>>>>> reduce the queue length and it should start accepting new indexing 
>>>>>> requests.
>>>>>> Maybe I am wrong, but I need some help to understand how to check it.
>>>>>>
>>>>>> AWS - Sorry, I don't have any physical hardware to replicate this
>>>>>> test locally
>>>>>>
>>>>>> GC - I monitored GC closely. If you take a look at CPU utilization
>>>>>> screenshot you will see a blue graph that is GC consumption. In addition 
>>>>>> to
>>>>>> that I am using Visual GC plugin from VisualVM to understand how GC
>>>>>> performs under the stress and don't see any anomalies.
>>>>>> There are several GC pauses from time to time but those are not
>>>>>> significant. Heap utilization graph tells me that GC is not struggling a
>>>>>> lot.
>>>>>>
>>>>>> Thank you again for your comments, hope the information above will
>>>>>> help you understand the problem.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson <
>>>>>> erickerick...@gmail.com> wrote:
>>>>>>
>>>>>>> Have you changed any of the merge policy parameters? I doubt it but
>>>>>>> just asking.
>>>>>>>
>>>>>>> My guess: your I/O is your bottleneck. There are a limited number of
>>>>>>> threads (tunable) that are used for background merging. When they're
>>>>>>> all busy, incoming updates are queued up. This squares with your
>>>>>>> statement that queries are fine and CPU activity is moderate.
>>>>>>>
>>>>>>> A quick test there would be to try this on a non-AWS setup if you
>>>>>>> have
>>>>>>> some hardware you can repurpose.
>>>>>>>
>>>>>>> an 80G heap is a red flag. Most of the time that's too large by far.
>>>>>>> So one thing I'd do is hook up some GC monitoring, you may be
>>>>>>> spending
>>>>>>> a horrible amount of time in GC cycles.
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > All,
>>>>>>> >
>>>>>>> > I would like to request some assistance with a situation described
>>>>>>> below. My
>>>>>>> > SolrCloud cluster accepts the update requests at a very low pace
>>>>>>> making it
>>>>>>> > impossible to index new documents.
>>>>>>> >
>>>>>>> > Cluster Setup:
>>>>>>> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
>>>>>>> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical
>>>>>>> memory,
>>>>>>> > 80GB Java Heap space, AWS
>>>>>>> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment
>>>>>>> (build
>>>>>>> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed
>>>>>>> mode)
>>>>>>> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
>>>>>>> >
>>>>>>> > Symptoms:
>>>>>>> > 1. 4 instances running 4 threads each are using SolrJ client to
>>>>>>> submit
>>>>>>> > documents to SolrCloud for indexing, do not perform any manual
>>>>>>> commits. Each
>>>>>>> > document  batch is 10 documents big, containing ~200 text fields
>>>>>>> per
>>>>>>> > document.
>>>>>>> > 2. After some time (~20-30 minutes, by that time I see only
>>>>>>> ~50-60K of
>>>>>>> > documents in a collection, node restarts do not help) I notice
>>>>>>> that clients
>>>>>>> > cannot submit new documents to the cluster for indexing anymore,
>>>>>>> each
>>>>>>> > operation takes enormous amount of time
>>>>>>> > 3. Cluster is not loaded at all, CPU consumption is moderate (I am
>>>>>>> seeing
>>>>>>> > that merging is performed all the time though), memory consumption
>>>>>>> is
>>>>>>> > adequate, but still updates are not accepted from external clients
>>>>>>> > 4. Search requests are handled fine
>>>>>>> > 5. I don't see any significant activity in SolrCloud logs
>>>>>>> anywhere, just
>>>>>>> > regular replication attempts only. No errors.
>>>>>>> >
>>>>>>> >
>>>>>>> > Additional information
>>>>>>> > 1. Please see Thread Dump attached.
>>>>>>> > 2. Please see SolrAdmin info with physical memory and file
>>>>>>> descriptor
>>>>>>> > utilization
>>>>>>> > 3. Please see VisualVM screenshots with CPU and memory utilization
>>>>>>> and CPU
>>>>>>> > profiling data. Physical memory utilization is about 60-70 percent
>>>>>>> all the
>>>>>>> > time.
>>>>>>> > 4. Schema file contains ~10 permanent fields 5 of which are mapped
>>>>>>> and
>>>>>>> > mandatory and persisted, the rest of the fields are optional and
>>>>>>> dynamic
>>>>>>> > 5. Solr config configures autoCommit to be set to 2 minutes and
>>>>>>> openSearcher
>>>>>>> > set to false
>>>>>>> > 6. Caches are set up with autoWarmCount = 0
>>>>>>> > 7. GC was fine tuned and I don't see any significant CPU
>>>>>>> utilization by GC
>>>>>>> > or any lengthy pauses. Majority of the garbage is collected in
>>>>>>> young gen
>>>>>>> > space.
>>>>>>> >
>>>>>>> > My primary question: I see that the cluster is alive and performs
>>>>>>> some
>>>>>>> > merging and commits but does not accept new documents for
>>>>>>> indexing. What is
>>>>>>> > causing this slowdown and why it does not accept new submissions?
>>>>>>> >
>>>>>>> >
>>>>>>> > Regards,
>>>>>>> > Denis
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>>
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>

Reply via email to