Re: SolrCloud cluster does not accept new documents for indexing

2018-04-25 Thread Denis Demichev
Shawn, Mikhail, Chris,

Thank you all for your feedback.
Unfortunately I cannot try your recommendations right away - this week is
busy.
Will post my results here next week.

Regards,
Denis


On Tue, Apr 24, 2018 at 11:33 AM Shawn Heisey  wrote:

> On 4/24/2018 6:30 AM, Chris Ulicny wrote:
> > I haven't worked with AWS, but recently we tried to move some of our solr
> > instances to a cloud in Google's Cloud offering, and it did not go well.
> > All of our problems ended up stemming from the fact that the I/O is
> > throttled. Any complicated enough query would require too many disk reads
> > to return the results in a reasonable time when being throttled. SSDs
> were
> > better but not a practical cost and not as performant as our own bare
> metal.
>
> If there's enough memory installed beyond what is required for the Solr
> heap, then Solr will rarely need to actually read the disk to satisfy a
> query.  That is the secret to stellar performance.  If switching to
> faster disks made a big difference in query performance, adding memory
> would yield an even greater improvement.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
>
> > When we were doing the initial indexing, the indexing processes would get
> > to a point where the updates were taking minutes to complete and the
> cause
> > was throttled write ops.
>
> Indexing speed is indeed affected by disk speed, and adding memory can't
> fix that particular problem.  Using a storage controller with a large
> amount of battery-backed cache memory can improve it.
>
> > -- set the max threads and max concurrent merges of the mergeScheduler to
> > be 1 (or very low). This prevented excessive IO during indexing.
>
> The max threads should be at 1 in the merge scheduler, but the max
> merges should actually be *increased*.  I use a value of 6 for that.
> With SSD disks, the max threads can be increased, but I wouldn't push it
> very high.
>
> Thanks,
> Shawn
>
>


Re: SolrCloud cluster does not accept new documents for indexing

2018-04-25 Thread Emir Arnautović
Hi Denis,
Merge works on segments and depending on merge strategy it is triggered 
separately so there is no some queue between update executor and merge threads.

Re SPM - I am using it on a daily bases for most of my consulting work and if 
you have SPM app you can invite me to it and I’ll take a quick look to see if 
there are some obvious bottlenecks.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 23 Apr 2018, at 23:37, Denis Demichev  wrote:
> 
> I conducted another experiment today with local SSD drives, but this did not 
> seem to fix my problem.
> Don't see any extensive I/O in this case:
> 
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
> xvda  1.7688.83 5.521256191  77996
> xvdb 13.95   111.30 56663.931573961  801303364
> 
> xvdb - is the device where SolrCloud is installed and data files are kept.
> 
> What I see:
> - There are 17 "Lucene Merge Thread #..." running. Some of them are blocked, 
> some of them are RUNNING
> - updateExecutor-N-thread-M threads are in parked mode and number of docs 
> that I am able to submit is still low
> - Tried to change maxIndexingThreads, set it to something high. This seems to 
> prolong the time when cluster is accepting new indexing requests and keeps 
> CPU utilization a lot higher while the cluster is merging indexes
> 
> Could anyone please point me to the right direction (documentation or Java 
> classes) where I can read about how data is passed from updateExecutor thread 
> pool to Merge Threads? I assume there should be some internal blocking queue 
> or something similar.
> Still cannot wrap my head around how Solr blocks incoming connections. Non 
> merged indexes are not kept in memory so I don't clearly understand why Solr 
> cannot keep writing index file to HDD while other threads are merging indexes 
> (since this is a continuous process anyway).
> 
> Does anyone use SPM monitoring tool for that type of problems? Is it of any 
> use at all?
> 
> 
> Thank you in advance.
> 
> 
> 
> 
> Regards,
> Denis
> 
> 
> On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev  > wrote:
> Mikhail,
> 
> Sure, I will keep everyone posted. Moving to non-HVM instance may take some 
> time, so hopefully I will be able to share my observations in the next couple 
> of days or so.
> Thanks again for all the help.
> 
> Regards,
> Denis
> 
> 
> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev  > wrote:
> Denis, please let me know what it ends up with. I'm really curious regarding 
> this case and AWS instace flavours. fwiw since 7.4 we'll have 
> ioThrottle=false option. 
> 
> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev  > wrote:
> Mikhail, Erick,
> 
> Thank you.
> 
> What just occurred to me - we don't use local SSD but instead we're using EBS 
> volumes.
> This was a wrong instance type that I looked at.
> Will try to set up a cluster with SSD nodes and retest.
> 
> Regards,
> Denis
> 
> 
> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev  > wrote:
> I'm not sure it's the right context, but here is one guy shows really low 
> throthle boundary 
> https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348
>  
> 
> 
> 
> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev  > wrote:
> Threads are hanging on merge io throthling 
> at 
> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
> at 
> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
> at 
> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
> at 
> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
> It seems odd. Please confirm that you don't commit on every update request. 
> The only way to monitor io throthling is to enable infostream and read a lot 
> of logs.
>
> 
> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev  > wrote:
> Erick,
> 
> Thank you for your quick response.
> 
> I/O bottleneck: Please see another screenshot attached, as you can see disk 
> r/w operations are pretty low or not significant.
> iostat==
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> xvda  0.00 0.000.000.00 0.00 0.00   

Re: SolrCloud cluster does not accept new documents for indexing

2018-04-24 Thread Shawn Heisey

On 4/24/2018 6:30 AM, Chris Ulicny wrote:

I haven't worked with AWS, but recently we tried to move some of our solr
instances to a cloud in Google's Cloud offering, and it did not go well.
All of our problems ended up stemming from the fact that the I/O is
throttled. Any complicated enough query would require too many disk reads
to return the results in a reasonable time when being throttled. SSDs were
better but not a practical cost and not as performant as our own bare metal.


If there's enough memory installed beyond what is required for the Solr 
heap, then Solr will rarely need to actually read the disk to satisfy a 
query.  That is the secret to stellar performance.  If switching to 
faster disks made a big difference in query performance, adding memory 
would yield an even greater improvement.


https://wiki.apache.org/solr/SolrPerformanceProblems#RAM


When we were doing the initial indexing, the indexing processes would get
to a point where the updates were taking minutes to complete and the cause
was throttled write ops.


Indexing speed is indeed affected by disk speed, and adding memory can't 
fix that particular problem.  Using a storage controller with a large 
amount of battery-backed cache memory can improve it.



-- set the max threads and max concurrent merges of the mergeScheduler to
be 1 (or very low). This prevented excessive IO during indexing.


The max threads should be at 1 in the merge scheduler, but the max 
merges should actually be *increased*.  I use a value of 6 for that.  
With SSD disks, the max threads can be increased, but I wouldn't push it 
very high.


Thanks,
Shawn



Re: SolrCloud cluster does not accept new documents for indexing

2018-04-24 Thread Mikhail Khludnev
Denis,
Can you enable infoSteam
https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#IndexConfiginSolrConfig-OtherIndexingSettings
and examine logs about throttling?
And what if you try without auto-commit?

On Tue, Apr 24, 2018 at 12:37 AM, Denis Demichev  wrote:

> I conducted another experiment today with local SSD drives, but this did
> not seem to fix my problem.
> Don't see any extensive I/O in this case:
>
>
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
>
> xvda  1.7688.83 5.521256191  77996
>
> xvdb 13.95   111.30 56663.931573961  801303364
>
> xvdb - is the device where SolrCloud is installed and data files are kept.
>
> What I see:
> - There are 17 "Lucene Merge Thread #..." running. Some of them are
> blocked, some of them are RUNNING
> - updateExecutor-N-thread-M threads are in parked mode and number of docs
> that I am able to submit is still low
> - Tried to change maxIndexingThreads, set it to something high. This seems
> to prolong the time when cluster is accepting new indexing requests and
> keeps CPU utilization a lot higher while the cluster is merging indexes
>
> Could anyone please point me to the right direction (documentation or Java
> classes) where I can read about how data is passed from updateExecutor
> thread pool to Merge Threads? I assume there should be some internal
> blocking queue or something similar.
> Still cannot wrap my head around how Solr blocks incoming connections. Non
> merged indexes are not kept in memory so I don't clearly understand why
> Solr cannot keep writing index file to HDD while other threads are merging
> indexes (since this is a continuous process anyway).
>
> Does anyone use SPM monitoring tool for that type of problems? Is it of
> any use at all?
>
>
> Thank you in advance.
>
> [image: image.png]
>
>
> Regards,
> Denis
>
>
> On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev  wrote:
>
>> Mikhail,
>>
>> Sure, I will keep everyone posted. Moving to non-HVM instance may take
>> some time, so hopefully I will be able to share my observations in the next
>> couple of days or so.
>> Thanks again for all the help.
>>
>> Regards,
>> Denis
>>
>>
>> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev  wrote:
>>
>>> Denis, please let me know what it ends up with. I'm really curious
>>> regarding this case and AWS instace flavours. fwiw since 7.4 we'll have
>>> ioThrottle=false option.
>>>
>>> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev 
>>> wrote:
>>>
 Mikhail, Erick,

 Thank you.

 What just occurred to me - we don't use local SSD but instead we're
 using EBS volumes.
 This was a wrong instance type that I looked at.
 Will try to set up a cluster with SSD nodes and retest.

 Regards,
 Denis


 On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev 
 wrote:

> I'm not sure it's the right context, but here is one guy shows really
> low throthle boundary
> https://issues.apache.org/jira/browse/SOLR-11200?
> focusedCommentId=16115348=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348
>
>
> On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev 
> wrote:
>
>> Threads are hanging on merge io throthling
>>
>> at 
>> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
>> at 
>> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
>> at 
>> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
>> at 
>> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
>>
>> It seems odd. Please confirm that you don't commit on every update
>> request.
>> The only way to monitor io throthling is to enable infostream and
>> read a lot of logs.
>>
>>
>> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev 
>> wrote:
>>
>>> Erick,
>>>
>>> Thank you for your quick response.
>>>
>>> I/O bottleneck: Please see another screenshot attached, as you can
>>> see disk r/w operations are pretty low or not significant.
>>> iostat==
>>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> xvda  0.00 0.000.000.00 0.00
>>> 0.00 0.00 0.000.000.000.00   0.00   0.00
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>   12.520.000.000.000.00   87.48
>>>
>>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

Re: SolrCloud cluster does not accept new documents for indexing

2018-04-24 Thread Chris Ulicny
I haven't worked with AWS, but recently we tried to move some of our solr
instances to a cloud in Google's Cloud offering, and it did not go well.
All of our problems ended up stemming from the fact that the I/O is
throttled. Any complicated enough query would require too many disk reads
to return the results in a reasonable time when being throttled. SSDs were
better but not a practical cost and not as performant as our own bare metal.

I'm not sure if that is what is happening in your case since it seemed like
your CPU time was mostly idle instead of I/O waits, but your case sounds a
lot like our when we started indexing in the cloud instances. There might
be an equivalent metric for AWS, but Google had the number of throttled
reads and writes available (albeit through StackDriver) that we could track.

When we were doing the initial indexing, the indexing processes would get
to a point where the updates were taking minutes to complete and the cause
was throttled write ops.

A few things we did to get everything indexing at a reasonable rate for the
initial setup:
-- autoCommit set to something very very low, like 10-15 seconds, and
openSearcher set to false
-- autoSoftCommit set to 1 hour or more (our indexing took days) to avoid
unnecessary read operations during indexing.
-- left the RAM buffer/buffered doc settings and maxIndexingThreads to the
defaults
-- set the max threads and max concurrent merges of the mergeScheduler to
be 1 (or very low). This prevented excessive IO during indexing.
-- Only keep one copy of each shard to avoid duplicate writes/merges on the
follower replicas. Add the redundant copies once after the bulk indexing.
-- There was some setting with respect to the storage objects to make them
faster at the expense of more CPU used (not waiting). Helped with indexing,
but not didn't make a difference in the long run.

With regards to SPM. I haven't used it to troubleshoot this type of problem
before, but we use it for all of our solr monitoring. The out-of-the-box
settings work very well for us, so I'm not sure how much metric
customization beyond the initially setup ones it allows.

Also, most of your attachments got filtered out by the mailing list,
particularly the images.

Best,
Chris

On Mon, Apr 23, 2018 at 5:38 PM Denis Demichev  wrote:

> I conducted another experiment today with local SSD drives, but this did
> not seem to fix my problem.
> Don't see any extensive I/O in this case:
>
>
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
>
> xvda  1.7688.83 5.521256191  77996
>
> xvdb 13.95   111.30 56663.931573961  801303364
>
> xvdb - is the device where SolrCloud is installed and data files are kept.
>
> What I see:
> - There are 17 "Lucene Merge Thread #..." running. Some of them are
> blocked, some of them are RUNNING
> - updateExecutor-N-thread-M threads are in parked mode and number of docs
> that I am able to submit is still low
> - Tried to change maxIndexingThreads, set it to something high. This seems
> to prolong the time when cluster is accepting new indexing requests and
> keeps CPU utilization a lot higher while the cluster is merging indexes
>
> Could anyone please point me to the right direction (documentation or Java
> classes) where I can read about how data is passed from updateExecutor
> thread pool to Merge Threads? I assume there should be some internal
> blocking queue or something similar.
> Still cannot wrap my head around how Solr blocks incoming connections. Non
> merged indexes are not kept in memory so I don't clearly understand why
> Solr cannot keep writing index file to HDD while other threads are merging
> indexes (since this is a continuous process anyway).
>
> Does anyone use SPM monitoring tool for that type of problems? Is it of
> any use at all?
>
>
> Thank you in advance.
>
> [image: image.png]
>
>
> Regards,
> Denis
>
>
> On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev  wrote:
>
>> Mikhail,
>>
>> Sure, I will keep everyone posted. Moving to non-HVM instance may take
>> some time, so hopefully I will be able to share my observations in the next
>> couple of days or so.
>> Thanks again for all the help.
>>
>> Regards,
>> Denis
>>
>>
>> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev  wrote:
>>
>>> Denis, please let me know what it ends up with. I'm really curious
>>> regarding this case and AWS instace flavours. fwiw since 7.4 we'll have
>>> ioThrottle=false option.
>>>
>>> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev 
>>> wrote:
>>>
 Mikhail, Erick,

 Thank you.

 What just occurred to me - we don't use local SSD but instead we're
 using EBS volumes.
 This was a wrong instance type that I looked at.
 Will try to set up a cluster with SSD nodes and retest.

 Regards,
 Denis


 On Thu, Apr 19, 2018 at 2:56 PM Mikhail 

Re: SolrCloud cluster does not accept new documents for indexing

2018-04-23 Thread Denis Demichev
I conducted another experiment today with local SSD drives, but this did
not seem to fix my problem.
Don't see any extensive I/O in this case:


Device:tpskB_read/skB_wrtn/skB_readkB_wrtn

xvda  1.7688.83 5.521256191  77996

xvdb 13.95   111.30 56663.931573961  801303364

xvdb - is the device where SolrCloud is installed and data files are kept.

What I see:
- There are 17 "Lucene Merge Thread #..." running. Some of them are
blocked, some of them are RUNNING
- updateExecutor-N-thread-M threads are in parked mode and number of docs
that I am able to submit is still low
- Tried to change maxIndexingThreads, set it to something high. This seems
to prolong the time when cluster is accepting new indexing requests and
keeps CPU utilization a lot higher while the cluster is merging indexes

Could anyone please point me to the right direction (documentation or Java
classes) where I can read about how data is passed from updateExecutor
thread pool to Merge Threads? I assume there should be some internal
blocking queue or something similar.
Still cannot wrap my head around how Solr blocks incoming connections. Non
merged indexes are not kept in memory so I don't clearly understand why
Solr cannot keep writing index file to HDD while other threads are merging
indexes (since this is a continuous process anyway).

Does anyone use SPM monitoring tool for that type of problems? Is it of any
use at all?


Thank you in advance.

[image: image.png]


Regards,
Denis


On Fri, Apr 20, 2018 at 1:28 PM Denis Demichev  wrote:

> Mikhail,
>
> Sure, I will keep everyone posted. Moving to non-HVM instance may take
> some time, so hopefully I will be able to share my observations in the next
> couple of days or so.
> Thanks again for all the help.
>
> Regards,
> Denis
>
>
> On Fri, Apr 20, 2018 at 6:02 AM Mikhail Khludnev  wrote:
>
>> Denis, please let me know what it ends up with. I'm really curious
>> regarding this case and AWS instace flavours. fwiw since 7.4 we'll have
>> ioThrottle=false option.
>>
>> On Thu, Apr 19, 2018 at 11:06 PM, Denis Demichev 
>> wrote:
>>
>>> Mikhail, Erick,
>>>
>>> Thank you.
>>>
>>> What just occurred to me - we don't use local SSD but instead we're
>>> using EBS volumes.
>>> This was a wrong instance type that I looked at.
>>> Will try to set up a cluster with SSD nodes and retest.
>>>
>>> Regards,
>>> Denis
>>>
>>>
>>> On Thu, Apr 19, 2018 at 2:56 PM Mikhail Khludnev 
>>> wrote:
>>>
 I'm not sure it's the right context, but here is one guy shows really
 low throthle boundary

 https://issues.apache.org/jira/browse/SOLR-11200?focusedCommentId=16115348=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16115348


 On Thu, Apr 19, 2018 at 8:37 PM, Mikhail Khludnev 
 wrote:

> Threads are hanging on merge io throthling
>
> at 
> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
> at 
> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
> at 
> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
> at 
> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
>
> It seems odd. Please confirm that you don't commit on every update
> request.
> The only way to monitor io throthling is to enable infostream and read
> a lot of logs.
>
>
> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev 
> wrote:
>
>> Erick,
>>
>> Thank you for your quick response.
>>
>> I/O bottleneck: Please see another screenshot attached, as you can
>> see disk r/w operations are pretty low or not significant.
>> iostat==
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> xvda  0.00 0.000.000.00 0.00 0.00
>> 0.00 0.000.000.000.00   0.00   0.00
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>   12.520.000.000.000.00   87.48
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> xvda  0.00 0.000.000.00 0.00 0.00
>> 0.00 0.000.000.000.00   0.00   0.00
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>   12.510.000.000.000.00   87.49
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> xvda  0.00 0.000.000.00 0.00 

Re: SolrCloud cluster does not accept new documents for indexing

2018-04-19 Thread Erick Erickson
When all indexing threads are occupied merging, incoming updates block
until at least one thread frees up IIUC.

The fact that you're not opening searchers doesn't matter as far as
merging is concerned, that happens regardless on hard commits.

Bumping your ram buffer up to 2G is usually unnecessary, we rarely see
increased throughput over about 100M but again I doubt that's the root
cause.

Unless AWS is throttling your I/O..

So this is puzzling to me, I don't have any more fresh ideas.

Best,
Erick

On Thu, Apr 19, 2018 at 10:48 AM, Denis Demichev  wrote:
> Mikhail,
>
> I see what you're saying. Thank you for the clarification.
> Yes, there's no single line in the client code that contains a commit
> statement.
> The only thing I do: solr.add(collectionName, dataToSend); where solr is a
> SolrClient.
> Autocommits are set up on the server side for 2 minutes and they do not open
> a searcher.
>
> What is interesting, if I look at CPU utilization by thread, I am seeing a
> pic attached.
> So at least 3 different threads are executing merges kinda concurrently.
>
> If I am not mistaken, based on the code I see, this stacktrace tells me that
> I have 7 merging threads running and they limit themselves to limit the CPU
> consumption...
> Still don't fully understand how this impacts the acceptor behavior and
> ability to handle "add" requests. Don't see this logical connection yet.
>
>
> Regards,
> Denis
>
>
> On Thu, Apr 19, 2018 at 1:37 PM Mikhail Khludnev  wrote:
>>
>> Threads are hanging on merge io throthling
>>
>> at
>> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
>> at
>> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
>> at
>> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
>> at
>> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
>>
>> It seems odd. Please confirm that you don't commit on every update
>> request.
>> The only way to monitor io throthling is to enable infostream and read a
>> lot of logs.
>>
>>
>> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev 
>> wrote:
>>
>> > Erick,
>> >
>> > Thank you for your quick response.
>> >
>> > I/O bottleneck: Please see another screenshot attached, as you can see
>> > disk r/w operations are pretty low or not significant.
>> > iostat==
>> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> > avgrq-sz
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > xvda  0.00 0.000.000.00 0.00 0.00
>> > 0.00 0.000.000.000.00   0.00   0.00
>> >
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >   12.520.000.000.000.00   87.48
>> >
>> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> > avgrq-sz
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > xvda  0.00 0.000.000.00 0.00 0.00
>> > 0.00 0.000.000.000.00   0.00   0.00
>> >
>> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>> >   12.510.000.000.000.00   87.49
>> >
>> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> > avgrq-sz
>> > avgqu-sz   await r_await w_await  svctm  %util
>> > xvda  0.00 0.000.000.00 0.00 0.00
>> > 0.00 0.000.000.000.00   0.00   0.00
>> > ==
>> >
>> > Merging threads: I don't see any modifications of a merging policy
>> > comparing to the default solrconfig.
>> > Index config: 2000<
>> > maxBufferedDocs>50
>> > Update handler: 
>> > Could you please help me understand how can I validate this theory?
>> > Another note here. Even if I remove the stress from the cluster I still
>> > see that merging thread is consuming CPU for some time. It may take
>> > hours
>> > and if I try to return the stress back nothing changes.
>> > If this is overloaded merging process, it should take some time to
>> > reduce
>> > the queue length and it should start accepting new indexing requests.
>> > Maybe I am wrong, but I need some help to understand how to check it.
>> >
>> > AWS - Sorry, I don't have any physical hardware to replicate this test
>> > locally
>> >
>> > GC - I monitored GC closely. If you take a look at CPU utilization
>> > screenshot you will see a blue graph that is GC consumption. In addition
>> > to
>> > that I am using Visual GC plugin from VisualVM to understand how GC
>> > performs under the stress and don't see any anomalies.
>> > There are several GC pauses from time to time but those are not
>> > significant. Heap utilization graph tells me that GC is not struggling a
>> > lot.
>> >
>> > Thank you again for your comments, hope the information above will help
>> > you understand the problem.
>> >
>> >
>> > Regards,
>> > Denis
>> >
>> >
>> > On Thu, 

Re: SolrCloud cluster does not accept new documents for indexing

2018-04-19 Thread Denis Demichev
Mikhail,

I see what you're saying. Thank you for the clarification.
Yes, there's no single line in the client code that contains a commit
statement.
The only thing I do: solr.add(collectionName, dataToSend); where solr is a
SolrClient.
Autocommits are set up on the server side for 2 minutes and they do not
open a searcher.

What is interesting, if I look at CPU utilization by thread, I am seeing a
pic attached.
So at least 3 different threads are executing merges kinda concurrently.

If I am not mistaken, based on the code I see, this stacktrace tells me
that I have 7 merging threads running and they limit themselves to limit
the CPU consumption...
Still don't fully understand how this impacts the acceptor behavior and
ability to handle "add" requests. Don't see this logical connection yet.


Regards,
Denis


On Thu, Apr 19, 2018 at 1:37 PM Mikhail Khludnev  wrote:

> Threads are hanging on merge io throthling
>
> at
> org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
> at
> org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
> at
> org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
> at
> org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)
>
> It seems odd. Please confirm that you don't commit on every update request.
> The only way to monitor io throthling is to enable infostream and read a
> lot of logs.
>
>
> On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev 
> wrote:
>
> > Erick,
> >
> > Thank you for your quick response.
> >
> > I/O bottleneck: Please see another screenshot attached, as you can see
> > disk r/w operations are pretty low or not significant.
> > iostat==
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz
> > avgqu-sz   await r_await w_await  svctm  %util
> > xvda  0.00 0.000.000.00 0.00 0.00
> > 0.00 0.000.000.000.00   0.00   0.00
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   12.520.000.000.000.00   87.48
> >
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz
> > avgqu-sz   await r_await w_await  svctm  %util
> > xvda  0.00 0.000.000.00 0.00 0.00
> > 0.00 0.000.000.000.00   0.00   0.00
> >
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >   12.510.000.000.000.00   87.49
> >
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz
> > avgqu-sz   await r_await w_await  svctm  %util
> > xvda  0.00 0.000.000.00 0.00 0.00
> > 0.00 0.000.000.000.00   0.00   0.00
> > ==
> >
> > Merging threads: I don't see any modifications of a merging policy
> > comparing to the default solrconfig.
> > Index config: 2000<
> > maxBufferedDocs>50
> > Update handler: 
> > Could you please help me understand how can I validate this theory?
> > Another note here. Even if I remove the stress from the cluster I still
> > see that merging thread is consuming CPU for some time. It may take hours
> > and if I try to return the stress back nothing changes.
> > If this is overloaded merging process, it should take some time to reduce
> > the queue length and it should start accepting new indexing requests.
> > Maybe I am wrong, but I need some help to understand how to check it.
> >
> > AWS - Sorry, I don't have any physical hardware to replicate this test
> > locally
> >
> > GC - I monitored GC closely. If you take a look at CPU utilization
> > screenshot you will see a blue graph that is GC consumption. In addition
> to
> > that I am using Visual GC plugin from VisualVM to understand how GC
> > performs under the stress and don't see any anomalies.
> > There are several GC pauses from time to time but those are not
> > significant. Heap utilization graph tells me that GC is not struggling a
> > lot.
> >
> > Thank you again for your comments, hope the information above will help
> > you understand the problem.
> >
> >
> > Regards,
> > Denis
> >
> >
> > On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson  >
> > wrote:
> >
> >> Have you changed any of the merge policy parameters? I doubt it but just
> >> asking.
> >>
> >> My guess: your I/O is your bottleneck. There are a limited number of
> >> threads (tunable) that are used for background merging. When they're
> >> all busy, incoming updates are queued up. This squares with your
> >> statement that queries are fine and CPU activity is moderate.
> >>
> >> A quick test there would be to try this on a non-AWS setup if you have
> >> some hardware you can repurpose.
> >>
> >> an 80G heap is a red flag. Most of the time that's too large by far.
> >> So one thing I'd do is hook up some GC monitoring, you may be spending
> 

Re: SolrCloud cluster does not accept new documents for indexing

2018-04-19 Thread Mikhail Khludnev
Threads are hanging on merge io throthling

at 
org.apache.lucene.index.MergePolicy$OneMergeProgress.pauseNanos(MergePolicy.java:150)
at 
org.apache.lucene.index.MergeRateLimiter.maybePause(MergeRateLimiter.java:148)
at 
org.apache.lucene.index.MergeRateLimiter.pause(MergeRateLimiter.java:93)
at 
org.apache.lucene.store.RateLimitedIndexOutput.checkRate(RateLimitedIndexOutput.java:78)

It seems odd. Please confirm that you don't commit on every update request.
The only way to monitor io throthling is to enable infostream and read a
lot of logs.


On Thu, Apr 19, 2018 at 7:59 PM, Denis Demichev  wrote:

> Erick,
>
> Thank you for your quick response.
>
> I/O bottleneck: Please see another screenshot attached, as you can see
> disk r/w operations are pretty low or not significant.
> iostat==
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> xvda  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   12.520.000.000.000.00   87.48
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> xvda  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   12.510.000.000.000.00   87.49
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> xvda  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> ==
>
> Merging threads: I don't see any modifications of a merging policy
> comparing to the default solrconfig.
> Index config: 2000<
> maxBufferedDocs>50
> Update handler: 
> Could you please help me understand how can I validate this theory?
> Another note here. Even if I remove the stress from the cluster I still
> see that merging thread is consuming CPU for some time. It may take hours
> and if I try to return the stress back nothing changes.
> If this is overloaded merging process, it should take some time to reduce
> the queue length and it should start accepting new indexing requests.
> Maybe I am wrong, but I need some help to understand how to check it.
>
> AWS - Sorry, I don't have any physical hardware to replicate this test
> locally
>
> GC - I monitored GC closely. If you take a look at CPU utilization
> screenshot you will see a blue graph that is GC consumption. In addition to
> that I am using Visual GC plugin from VisualVM to understand how GC
> performs under the stress and don't see any anomalies.
> There are several GC pauses from time to time but those are not
> significant. Heap utilization graph tells me that GC is not struggling a
> lot.
>
> Thank you again for your comments, hope the information above will help
> you understand the problem.
>
>
> Regards,
> Denis
>
>
> On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson 
> wrote:
>
>> Have you changed any of the merge policy parameters? I doubt it but just
>> asking.
>>
>> My guess: your I/O is your bottleneck. There are a limited number of
>> threads (tunable) that are used for background merging. When they're
>> all busy, incoming updates are queued up. This squares with your
>> statement that queries are fine and CPU activity is moderate.
>>
>> A quick test there would be to try this on a non-AWS setup if you have
>> some hardware you can repurpose.
>>
>> an 80G heap is a red flag. Most of the time that's too large by far.
>> So one thing I'd do is hook up some GC monitoring, you may be spending
>> a horrible amount of time in GC cycles.
>>
>> Best,
>> Erick
>>
>> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev 
>> wrote:
>> >
>> > All,
>> >
>> > I would like to request some assistance with a situation described
>> below. My
>> > SolrCloud cluster accepts the update requests at a very low pace making
>> it
>> > impossible to index new documents.
>> >
>> > Cluster Setup:
>> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
>> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical
>> memory,
>> > 80GB Java Heap space, AWS
>> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build
>> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
>> >
>> > Symptoms:
>> > 1. 4 instances running 4 threads each are using SolrJ client to submit
>> > documents to SolrCloud for indexing, do not perform any manual commits.
>> Each
>> > document  batch is 10 documents big, containing ~200 text fields per
>> > document.
>> > 2. After some 

Re: SolrCloud cluster does not accept new documents for indexing

2018-04-19 Thread Denis Demichev
Erick,

Thank you for your quick response.

I/O bottleneck: Please see another screenshot attached, as you can see disk
r/w operations are pretty low or not significant.
iostat==
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
xvda  0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  12.520.000.000.000.00   87.48

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
xvda  0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  12.510.000.000.000.00   87.49

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
xvda  0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
==

Merging threads: I don't see any modifications of a merging policy
comparing to the default solrconfig.
Index config:
200050
Update handler: 
Could you please help me understand how can I validate this theory?
Another note here. Even if I remove the stress from the cluster I still see
that merging thread is consuming CPU for some time. It may take hours and
if I try to return the stress back nothing changes.
If this is overloaded merging process, it should take some time to reduce
the queue length and it should start accepting new indexing requests.
Maybe I am wrong, but I need some help to understand how to check it.

AWS - Sorry, I don't have any physical hardware to replicate this test
locally

GC - I monitored GC closely. If you take a look at CPU utilization
screenshot you will see a blue graph that is GC consumption. In addition to
that I am using Visual GC plugin from VisualVM to understand how GC
performs under the stress and don't see any anomalies.
There are several GC pauses from time to time but those are not
significant. Heap utilization graph tells me that GC is not struggling a
lot.

Thank you again for your comments, hope the information above will help you
understand the problem.


Regards,
Denis


On Thu, Apr 19, 2018 at 12:31 PM Erick Erickson 
wrote:

> Have you changed any of the merge policy parameters? I doubt it but just
> asking.
>
> My guess: your I/O is your bottleneck. There are a limited number of
> threads (tunable) that are used for background merging. When they're
> all busy, incoming updates are queued up. This squares with your
> statement that queries are fine and CPU activity is moderate.
>
> A quick test there would be to try this on a non-AWS setup if you have
> some hardware you can repurpose.
>
> an 80G heap is a red flag. Most of the time that's too large by far.
> So one thing I'd do is hook up some GC monitoring, you may be spending
> a horrible amount of time in GC cycles.
>
> Best,
> Erick
>
> On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev 
> wrote:
> >
> > All,
> >
> > I would like to request some assistance with a situation described
> below. My
> > SolrCloud cluster accepts the update requests at a very low pace making
> it
> > impossible to index new documents.
> >
> > Cluster Setup:
> > Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
> > Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical
> memory,
> > 80GB Java Heap space, AWS
> > Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build
> > 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
> > Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
> >
> > Symptoms:
> > 1. 4 instances running 4 threads each are using SolrJ client to submit
> > documents to SolrCloud for indexing, do not perform any manual commits.
> Each
> > document  batch is 10 documents big, containing ~200 text fields per
> > document.
> > 2. After some time (~20-30 minutes, by that time I see only ~50-60K of
> > documents in a collection, node restarts do not help) I notice that
> clients
> > cannot submit new documents to the cluster for indexing anymore, each
> > operation takes enormous amount of time
> > 3. Cluster is not loaded at all, CPU consumption is moderate (I am seeing
> > that merging is performed all the time though), memory consumption is
> > adequate, but still updates are not accepted from external clients
> > 4. Search requests are handled fine
> > 5. I don't see any significant activity in SolrCloud logs anywhere, just
> > regular replication attempts only. No errors.
> >
> >
> > Additional information
> > 1. Please see Thread Dump attached.
> > 2. Please see SolrAdmin info with physical memory and file descriptor
> > utilization
> > 3. Please see 

Re: SolrCloud cluster does not accept new documents for indexing

2018-04-19 Thread Erick Erickson
Have you changed any of the merge policy parameters? I doubt it but just asking.

My guess: your I/O is your bottleneck. There are a limited number of
threads (tunable) that are used for background merging. When they're
all busy, incoming updates are queued up. This squares with your
statement that queries are fine and CPU activity is moderate.

A quick test there would be to try this on a non-AWS setup if you have
some hardware you can repurpose.

an 80G heap is a red flag. Most of the time that's too large by far.
So one thing I'd do is hook up some GC monitoring, you may be spending
a horrible amount of time in GC cycles.

Best,
Erick

On Thu, Apr 19, 2018 at 8:23 AM, Denis Demichev  wrote:
>
> All,
>
> I would like to request some assistance with a situation described below. My
> SolrCloud cluster accepts the update requests at a very low pace making it
> impossible to index new documents.
>
> Cluster Setup:
> Clients - 4 JVMs, 4 threads each, using SolrJ to submit data
> Cluster - SolrCloud 7.2.1, 10 instances r4.4xlarge, 120GB physical memory,
> 80GB Java Heap space, AWS
> Java - openjdk version "1.8.0_161" OpenJDK Runtime Environment (build
> 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
> Zookeeper - 3 standalone nodes on t2.large running under Exhibitor
>
> Symptoms:
> 1. 4 instances running 4 threads each are using SolrJ client to submit
> documents to SolrCloud for indexing, do not perform any manual commits. Each
> document  batch is 10 documents big, containing ~200 text fields per
> document.
> 2. After some time (~20-30 minutes, by that time I see only ~50-60K of
> documents in a collection, node restarts do not help) I notice that clients
> cannot submit new documents to the cluster for indexing anymore, each
> operation takes enormous amount of time
> 3. Cluster is not loaded at all, CPU consumption is moderate (I am seeing
> that merging is performed all the time though), memory consumption is
> adequate, but still updates are not accepted from external clients
> 4. Search requests are handled fine
> 5. I don't see any significant activity in SolrCloud logs anywhere, just
> regular replication attempts only. No errors.
>
>
> Additional information
> 1. Please see Thread Dump attached.
> 2. Please see SolrAdmin info with physical memory and file descriptor
> utilization
> 3. Please see VisualVM screenshots with CPU and memory utilization and CPU
> profiling data. Physical memory utilization is about 60-70 percent all the
> time.
> 4. Schema file contains ~10 permanent fields 5 of which are mapped and
> mandatory and persisted, the rest of the fields are optional and dynamic
> 5. Solr config configures autoCommit to be set to 2 minutes and openSearcher
> set to false
> 6. Caches are set up with autoWarmCount = 0
> 7. GC was fine tuned and I don't see any significant CPU utilization by GC
> or any lengthy pauses. Majority of the garbage is collected in young gen
> space.
>
> My primary question: I see that the cluster is alive and performs some
> merging and commits but does not accept new documents for indexing. What is
> causing this slowdown and why it does not accept new submissions?
>
>
> Regards,
> Denis