Re: SolrCloud cluster does not accept new documents for indexing

Mikhail Khludnev Tue, 24 Apr 2018 08:09:27 -0700

Denis,
Can you enable infoSteam
https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#IndexConfiginSolrConfig-OtherIndexingSettings
and examine logs about throttling?
And what if you try without auto-commit?


On Tue, Apr 24, 2018 at 12:37 AM, Denis Demichev <demic...@gmail.com> wrote:

> I conducted another experiment today with local SSD drives, but this did
> not seem to fix my problem.
> Don't see any extensive I/O in this case:
>
>
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>
> xvda              1.76        88.83         5.52    1256191      77996
>
> xvdb             13.95       111.30     56663.93    1573961  801303364
>
> xvdb - is the device where SolrCloud is installed and data files are kept.
>
> What I see:
> - There are 17 "Lucene Merge Thread #..." running. Some of them are
> blocked, some of them are RUNNING
> - updateExecutor-N-thread-M threads are in parked mode and number of docs
> that I am able to submit is still low
> - Tried to change maxIndexingThreads, set it to something high. This seems
> to prolong the time when cluster is accepting new indexing requests and
> keeps CPU utilization a lot higher while the cluster is merging indexes
>
> Could anyone please point me to the right direction (documentation or Java
> classes) where I can read about how data is passed from updateExecutor
> thread pool to Merge Threads? I assume there should be some internal
> blocking queue or something similar.
> Still cannot wrap my head around how Solr blocks incoming connections. Non
> merged indexes are not kept in memory so I don't clearly understand why
> Solr cannot keep writing index file to HDD while other threads are merging
> indexes (since this is a continuous process anyway).
>
> Does anyone use SPM monitoring tool for that type of problems? Is it of
> any use at all?
>
>
> Thank you in advance.
>
> [image: image.png]
>
>
> Regards,
> Denis
>
>
> On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev <demic...@gmail.com> wrote:
>
>> Mikhail,
>>
>> Sure, I will keep everyone posted. Moving to non-HVM instance may take
>> some time, so hopefully I will be able to share my observations in the next
>> couple of days or so.
>> Thanks again for all the help.
>>
>> Regards,
>> Denis
>>
>>
>> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev <m...@apache.org> wrote:
>>
>>> Denis, please let me know what it ends up with. I'm really curious
>>> regarding this case and AWS instace flavours. fwiw since 7.4 we'll have
>>> ioThrottle=false option.
>>>
>>> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev <demic...@gmail.com>
>>> wrote:
>>>
>>>> Mikhail, Erick,
>>>>
>>>> Thank you.
>>>>
>>>> What just occurred to me - we don't use local SSD but instead we're
>>>> using EBS volumes.
>>>> This was a wrong instance type that I looked at.
>>>> Will try to set up a cluster with SSD nodes and retest.
>>>>
>>>> Regards,
>>>> Denis
>>>>
>>>>
>>>> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev <m...@apache.org>
>>>> wrote:
>>>>
>>>>> I'm not sure it's the right context, but here is one guy shows really
>>>>> low throthle boundary
>>>>> https://issues.apache.org/jira/browse/SOLR-11200?
>>>>> focusedCommentId=16115348&page=com.atlassian.jira.
>>>>> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348
>>>>>
>>>>>
>>>>> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev <m...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Threads are hanging on merge io throthling
>>>>>>
>>>>>>         at 
>>>>>> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
>>>>>>         at 
>>>>>> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
>>>>>>         at 
>>>>>> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
>>>>>>         at 
>>>>>> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
>>>>>>
>>>>>> It seems odd. Please confirm that you don't commit on every update
>>>>>> request.
>>>>>> The only way to monitor io throthling is to enable infostream and
>>>>>> read a lot of logs.
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev <demic...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Erick,
>>>>>>>
>>>>>>> Thank you for your quick response.
>>>>>>>
>>>>>>> I/O bottleneck: Please see another screenshot attached, as you can
>>>>>>> see disk r/w operations are pretty low or not significant.
>>>>>>> iostat==========
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> xvda              0.00     0.00    0.00    0.00     0.00
>>>>>>> 0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>           12.52    0.00    0.00    0.00    0.00   87.48
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> xvda              0.00     0.00    0.00    0.00     0.00
>>>>>>> 0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>>
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>           12.51    0.00    0.00    0.00    0.00   87.49
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> xvda              0.00     0.00    0.00    0.00     0.00
>>>>>>> 0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
>>>>>>> ==========================
>>>>>>>
>>>>>>> Merging threads: I don't see any modifications of a merging policy
>>>>>>> comparing to the default solrconfig.
>>>>>>> Index config: <ramBufferSizeMB>2000</ramBufferSizeMB><
>>>>>>> maxBufferedDocs>500000</maxBufferedDocs>
>>>>>>> Update handler: <updateHandler class="solr.DirectUpdateHandler2">
>>>>>>> Could you please help me understand how can I validate this theory?
>>>>>>> Another note here. Even if I remove the stress from the cluster I
>>>>>>> still see that merging thread is consuming CPU for some time. It may 
>>>>>>> take
>>>>>>> hours and if I try to return the stress back nothing changes.
>>>>>>> If this is overloaded merging process, it should take some time to
>>>>>>> reduce the queue length and it should start accepting new indexing 
>>>>>>> requests.
>>>>>>> Maybe I am wrong, but I need some help to understand how to check it.
>>>>>>>
>>>>>>> AWS - Sorry, I don't have any physical hardware to replicate this
>>>>>>> test locally
>>>>>>>
>>>>>>> GC - I monitored GC closely. If you take a look at CPU utilization
>>>>>>> screenshot you will see a blue graph that is GC consumption. In 
>>>>>>> addition to
>>>>>>> that I am using Visual GC plugin from VisualVM to understand how GC
>>>>>>> performs under the stress and don't see any anomalies.
>>>>>>> There are several GC pauses from time to time but those are not
>>>>>>> significant. Heap utilization graph tells me that GC is not struggling a
>>>>>>> lot.
>>>>>>>
>>>>>>> Thank you again for your comments, hope the information above will
>>>>>>> help you understand the problem.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson <
>>>>>>> erickerick...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Have you changed any of the merge policy parameters? I doubt it but
>>>>>>>> just asking.
>>>>>>>>
>>>>>>>> My guess: your I/O is your bottleneck. There are a limited number of
>>>>>>>> threads (tunable) that are used for background merging. When they're
>>>>>>>> all busy, incoming updates are queued up. This squares with your
>>>>>>>> statement that queries are fine and CPU activity is moderate.
>>>>>>>>
>>>>>>>> A quick test there would be to try this on a non-AWS setup if you
>>>>>>>> have
>>>>>>>> some hardware you can repurpose.
>>>>>>>>
>>>>>>>> an 80G heap is a red flag. Most of the time that's too large by far.
>>>>>>>> So one thing I'd do is hook up some GC monitoring, you may be
>>>>>>>> spending
>>>>>>>> a horrible amount of time in GC cycles.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Erick
>>>>>>>>
>>>>>>>> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev <demic...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > All,
>>>>>>>> >
>>>>>>>> > I would like to request some assistance with a situation
>>>>>>>> described below. My
>>>>>>>> > SolrCloud cluster accepts the update requests at a very low pace
>>>>>>>> making it
>>>>>>>> > impossible to index new documents.
>>>>>>>> >
>>>>>>>> > Cluster Setup:
>>>>>>>> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
>>>>>>>> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB
>>>>>>>> physical memory,
>>>>>>>> > 80GB Java Heap space, AWS
>>>>>>>> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment
>>>>>>>> (build
>>>>>>>> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed
>>>>>>>> mode)
>>>>>>>> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
>>>>>>>> >
>>>>>>>> > Symptoms:
>>>>>>>> > 1. 4 instances running 4 threads each are using SolrJ client to
>>>>>>>> submit
>>>>>>>> > documents to SolrCloud for indexing, do not perform any manual
>>>>>>>> commits. Each
>>>>>>>> > document  batch is 10 documents big, containing ~200 text fields
>>>>>>>> per
>>>>>>>> > document.
>>>>>>>> > 2. After some time (~20-30 minutes, by that time I see only
>>>>>>>> ~50-60K of
>>>>>>>> > documents in a collection, node restarts do not help) I notice
>>>>>>>> that clients
>>>>>>>> > cannot submit new documents to the cluster for indexing anymore,
>>>>>>>> each
>>>>>>>> > operation takes enormous amount of time
>>>>>>>> > 3. Cluster is not loaded at all, CPU consumption is moderate (I
>>>>>>>> am seeing
>>>>>>>> > that merging is performed all the time though), memory
>>>>>>>> consumption is
>>>>>>>> > adequate, but still updates are not accepted from external clients
>>>>>>>> > 4. Search requests are handled fine
>>>>>>>> > 5. I don't see any significant activity in SolrCloud logs
>>>>>>>> anywhere, just
>>>>>>>> > regular replication attempts only. No errors.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Additional information
>>>>>>>> > 1. Please see Thread Dump attached.
>>>>>>>> > 2. Please see SolrAdmin info with physical memory and file
>>>>>>>> descriptor
>>>>>>>> > utilization
>>>>>>>> > 3. Please see VisualVM screenshots with CPU and memory
>>>>>>>> utilization and CPU
>>>>>>>> > profiling data. Physical memory utilization is about 60-70
>>>>>>>> percent all the
>>>>>>>> > time.
>>>>>>>> > 4. Schema file contains ~10 permanent fields 5 of which are
>>>>>>>> mapped and
>>>>>>>> > mandatory and persisted, the rest of the fields are optional and
>>>>>>>> dynamic
>>>>>>>> > 5. Solr config configures autoCommit to be set to 2 minutes and
>>>>>>>> openSearcher
>>>>>>>> > set to false
>>>>>>>> > 6. Caches are set up with autoWarmCount = 0
>>>>>>>> > 7. GC was fine tuned and I don't see any significant CPU
>>>>>>>> utilization by GC
>>>>>>>> > or any lengthy pauses. Majority of the garbage is collected in
>>>>>>>> young gen
>>>>>>>> > space.
>>>>>>>> >
>>>>>>>> > My primary question: I see that the cluster is alive and performs
>>>>>>>> some
>>>>>>>> > merging and commits but does not accept new documents for
>>>>>>>> indexing. What is
>>>>>>>> > causing this slowdown and why it does not accept new submissions?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Regards,
>>>>>>>> > Denis
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely yours
>>>>>> Mikhail Khludnev
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>>
>>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>>
>>


-- 
Sincerely yours
Mikhail Khludnev

Re: SolrCloud cluster does not accept new documents for indexing

Reply via email to