Re: Solr Cloud Autoscaling Basics

2021-02-24 Thread yasoobhaider
Any pointers here would be appreciated :)



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Cloud Autoscaling Basics

2021-02-23 Thread yasoobhaider
Hi 

I have a master slave architecture setup currently. I'm evaluating
SolrCloud. 

I've read through most of the documentation, but what I can't seem to find
is the preferred way to autoscale the cluster. 

In the master slave architecture, we have a autoscaling policy (CPU based)
configured on Spotinst which adds a node and it replicates the index from
the master and starts serving traffic. This helps us bring the cost of the
cluster down by downscaling the cluster at low traffic times. 

What would be right way to do this in Cloud mode? Will replication factor
need to be changed again and again for this? Would I have to call the
ADDREPLICA api after adding a node as suggested here?
https://lucene.472066.n3.nabble.com/Autoscaling-in-8-2-td4451650.html#a4451659.
It seems rather cumbersome compared to Master-Slave architecture 

How is SolrCloud usually run in production environments? With autoscaling
turned off completely (apart from failure scenarios where a node crashes for
ex)?



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Extremely Small Segments

2021-02-12 Thread yasoobhaider
Hi

I am migrating from master slave to Solr Cloud but I'm running into problems
with indexing.

Cluster details:

8 machines of 64GB memory, each hosting 1 replica.
4 shards, 2 replica of each. Heap size is 16GB.

Collection details:

Total number of docs: ~250k (but only 50k are indexed right now)
Size of collection (master slave number for reference): ~10GB

Our collection is fairly heavy with some dynamic fields with high
cardinality (of order of ~1000s), which is why the large heap size for even
a small collection.

Relevant solrconfig settings:

commit settings:


  1
  360
  false



  ${solr.autoSoftCommit.maxTime:180}


index config:

500
1


  10
  10



   
 6
 4
   


Problem:

I setup the cloud and started indexing at the throughput of our earlier
master-slave setup, but soon the machines ran into full blown Garbage
Collection. This throughput was not a lot though. We index the whole
collection overnight, so roughly ~250k documents in 6 hours. That's roughly
12rps.

So now I'm doing indexing at an extremely slow rate trying to find the
problem.

Currently I'm indexing at 1 document/2seconds, so every minute ~30
documents.

Observations:

1. I'm noticing extremely small segments in the segments UI. Example:

Segment _1h4:
#docs: 5
#dels: 0
size: 1,586,878 bytes
age: 2021-02-12T11:05:33.050Z
source: flush

Why is lucene creating such small segments? I understood that segments are
created when ramBufferSizeMB or maxBufferedDocs limit is hit. Or on a hard
commit. Neither of those should lead to such small segments.

2. The index/ directory has a large number of files. For one shard with 30k
documents & 1.5GB size, there are ~450-550 files in this directory. I
understand that each segment is composed of a bunch of files. Even
accounting for that, the number of segments seems very large.

Note: Nothing out of the ordinary in logs. Only /update request logs.

Please help with making sense of the 2 observations above.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: CMS GC - Old Generation collection never finishes (due to GC Allocation Failure?)

2018-10-14 Thread yasoobhaider
After none of the JVM configuration options helped witH GC, as Erick
suggested I took a heap dump of one of the misbehaving slaves and analysis
shows that fieldcache is using a large amount of the the total heap.

Memory Analyzer output:

One instance of "org.apache.solr.uninverting.FieldCacheImpl" loaded by
"org.eclipse.jetty.webapp.WebAppClassLoader @ 0x7f60f7b38658" occupies
61,234,712,560 (91.86%) bytes. The memory is accumulated in one instance of
"java.util.HashMap$Node[]" loaded by "".

Hypotheses:

Without regular indexing, commits are not happening, so the searcher is not
being re opened, and field cache is not being reset. Since there is only one
instance of this field cache, it is a live object and not being cleaned up
in GC.

But I also noticed that the fieldcache entries on solr UI have the same
entries for all collections on that solr instance.

Ques 1. Is the field cache reset on commit? If so, is it reset when any of
the collections are committed? Or it is not reset at all and I'm missing
something here?
Ques 2. Is there a way to reset/delete this cache every x minutes (the
current autocommit duration) irrespective of whether documents were added or
not?

Other than this, I think the reason for huge heap usage (as others have
pointed out) is that we are not using docValues for any of the fields, and
we use a large number of fields in sorting functions (between 15-20 over all
queries combined). As the next step on this front, I will add new fields
with docvalues true and reindex the entire collection. Hopefully that will
help.

We use quite a few dynamic fields in sorting. There is no mention of using
docvalues with dynamic fields in the official documentation
(https://lucene.apache.org/solr/guide/6_6/docvalues.html). 

Ques 3. Do docvalues work with dynamic fields or not? If they do, anything
in particular that I should look out for, like the cardinality of the field
(ie number of different x's in example_dynamic_field_x)?

Shawn, I've uploaded my configuration files for the two collections here:
https://ufile.io/u6oe0 (tar -zxvf c1a_confs.tar.gz to decompress)

c1 collection is ~10GB when optimized, and has 2.5 million documents.
ca collection is ~2GB when optimized, and has 9.5 million documents.

Please let me know if you think there is something amiss in the
configuration that I should fix.

Thanks
Yasoob



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: CMS GC - Old Generation collection never finishes (due to GC Allocation Failure?)

2018-10-11 Thread yasoobhaider
Hi Shawn, thanks for the inputs.

I have uploaded the gc logs of one of the slaves here:
https://ufile.io/ecvag (should work till 18th Oct '18)

I uploaded the logs to gceasy as well and it says that the problem is
consecutive full GCs. According to the solution they have mentioned,
increasing the heap size is a solution. But I am already running on a pretty
big heap, so don't think increasing the heap size is going to be a long term
solution.

>From what I understood from a bit more looking around, this is Concurrent
Mode Failure for CMS. I found an old blog mentioning the use of
XX:CMSFullGCsBeforeCompaction=1 to make sure that compaction is done prior
to next collection trigger. So if it is a fragmentation problem, this will
solve it I hope.

I will also try out using docValues as suggested by Ere on a couple of
fields on which we make a lot of faceting queries to reduce memory usage on
the slaves.

Please share any ideas that you may have from the gc logs analysis

Thanks
Yasoob



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


CMS GC - Old Generation collection never finishes (due to GC Allocation Failure?)

2018-10-03 Thread yasoobhaider
Hi

I'm working with a Solr cluster with master-slave architecture.

Master and slave config:
ram: 120GB
cores: 16

At any point there are between 10-20 slaves in the cluster, each serving ~2k
requests per minute. Each slave houses two collections of approx 10G
(~2.5mil docs) and 2G(10mil docs) when optimized.

I am working with Solr 6.2.1

Solr configuration:

-XX:+CMSParallelRemarkEnabled
-XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:-OmitStackTraceInFastThrow
-XX:CMSInitiatingOccupancyFraction=50
-XX:CMSMaxAbortablePrecleanTime=6000
-XX:ConcGCThreads=4
-XX:MaxTenuringThreshold=8
-XX:ParallelGCThreads=4
-XX:PretenureSizeThreshold=64m
-XX:SurvivorRatio=15
-XX:TargetSurvivorRatio=90
-Xmn10G
-Xms80G
-Xmx80G

Some of these configurations have been reached by multiple trial and errors
over time, including the huge heap size.

This cluster usually runs without any error.

In the usual scenario, old gen gc is triggered according to the
configuration at 50% old gen occupancy, and the collector clears out the
memory over the next minute or so. This happens every 10-15 minutes.

However, I have noticed that sometimes the GC pattern of the slaves
completely changes and old gen gc is not able to clear the memory.

After observing the gc logs closely for multiple old gen gc collections, I
noticed that the old gen gc is triggered at 50% occupancy, but if there is a
GC Allocation Failure before the collection completes (after CMS Initial
Remark but before CMS reset), the old gen collection is not able to clear
much memory. And as soon as this collection completes, another old gen gc is
triggered.

And in worst case scenarios, this cycle of old gen gc triggering, GC
allocation failure keeps happening, and the old gen memory keeps increasing,
leading to a single threaded STW GC, which is not able to do much, and I
have to restart the solr server.

The last time this happened after the following sequence of events:

1. We optimized the bigger collection bringing it to its optimized size of
~10G.
2. For an unrelated reason, we had stopped indexing to the master. We
usually index at a low-ish throughput of ~1mil docs/day. This is relevant as
when we are indexing, the size of the collection increases, and this effects
the heap size used by collection.
3. The slaves started behaving erratically, with old gc collection not being
able to free up the required memory and finally being stuck in a STW GC.

As unlikely as this sounds, this is the only thing that changed on the
cluster. There was no change in query throughput or type of queries.

I restarted the slaves multiple times but the gc behaved in the same way for
over three days. Then when we fixed the indexing and made it live, the
slaves resumed their original gc pattern and are running without any issues
for over 24 hours now.

I would really be grateful for any advice on the following:

1. What could be the reason behind CMS not being able to free up the memory?
What are some experiments I can run to solve this problem?
2. Can stopping/starting indexing be a reason for such drastic changes to GC
pattern?
3. I have read at multiple places on this mailing list that the heap size
should be much lower (2x-3x the size of collection), but the last time I
tried CMS was not able to run smoothly and GC STW would occur which was only
solved by a restart. My reasoning for this is that the type of queries and
the throughput are also a factor in deciding the heap size, so it may be
that our queries are creating too many objects maybe. Is my reasoning
correct or should I try with a lower heap size (if it helps achieve a stable
gc pattern)?

(4. Silly question, but what is the right way to ask question on the mailing
list? via mail or via the nabble website? I sent this question earlier as a
mail, but it was not showing up on the nabble website so I am posting it
from the website now)

-
-

Logs which show this:


Desired survivor size 568413384 bytes, new threshold 2 (max 8)
- age   1:  437184344 bytes,  437184344 total
- age   2:  194385736 bytes,  631570080 total
: 9868992K->616768K(9868992K), 1.7115191 secs]
48349347K->40160469K(83269312K), 1.7116410 secs] [Times: user=6.25 sys=0.00,
real=1.71 secs]
Heap after GC invocations=921 (full 170):
 par new generation   total 9868992K, used 616768K [0x7f4f8400,
0x7f520400, 0x7f520400)
  eden space 9252224K,   0% used [0x7f4f8400, 0x7f4f8400,
0x7f51b8b6)
  from space 616768K, 100% used [0x7f51de5b, 0x7f520400,
0x7f520400)
  to   space 616768K,   0% used [0x7f51b8b6, 0x7f51b8b6,
0x7f51de5

Replication on startup takes a long time

2017-09-23 Thread yasoobhaider
Hi

We have setup a master-slave architecture for our Solr instance.

Number of docs: 2 million
Collection size: ~12GB when optimized
Heap size: 40G
Machine specs: 60G, 8 cores

We are using Solr 6.2.1.

Autocommit Configuration:


  4
  90
  false



  ${solr.autoSoftCommit.maxTime:360}


I have setup the maxDocs at 40k because we do a heavy weekly indexing, and I
didn't want a lot of commits happening too fast.

Indexing runs smoothly on master. But when I add a new slave pointing to the
master, it takes about 20 minutes for the slave to become queryable.

There are two parts to this latency. First, it takes approximately 13
minutes for the generation of the slave to be same as master. Then it takes
another 7 minutes for the instance to become queryable (it returns 0 hits in
these 7 minutes).

I checked the logs and the collection is downloaded within two minutes.
After that, there is nothing in the logs for next few minutes, even with
LoggingInfoSteam set to 'ALL'.

Question 1. What happens after all the files have been downloaded on slave
from master? What is Solr doing internally that the generation sync up with
master takes so long? Whatever it is doing, should it take that long? (~5
minutes).

After the generation sync up happens, it takes another 7 minutes to start
giving results. I set the autowarm count in all caches to 0, which brought
it down to 3 minutes.

Question 2. What is happening here in the 3 minutes? Can this also be
optimized?

And I wanted to ask another unrelated question regarding when a slave become
searchable. I understand that documents on master become searchable if a
hard commit happens with openSearcher set to true, or when a soft commit
happens. But when do documents become searchable on a slave?

Question 3a. When do documents become searchable on a slave? As soon as a
segment is copied over from master? Does softcommit make any sense on a
slave, as we are not indexing anything? Does autocommit with opensearcher
true affect slave in any way?

Question 3b. Does a softcommit on master affect slave in any way? (I only
have commit and startup options in my replicateAfter field in solrconfig)

Would appreciate any help.

PS: One of my colleague said that the latency may be because our schema.xml
is huge (~500 fields). Question 4. Could that be a reason?

Thanks
Yasoob Haider



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: CommitScheduler Thread blocked due to excessive number of Merging Threads

2017-09-09 Thread yasoobhaider
Hi Shawn

Thanks for putting the settings in context. This definitely helps.

Before I put these settings, I was doing a bit more digging, and wanted to
really understand why the merging was so slow. Looking at the thread dump of
Solr 6.6 and Solr 5.4, I found that in the merging process, the merging of
"postings" is taking a lot of time on 6.6 compared to 5.4.

In a merge containing 500 docs, it took on average 100msec on 5.4, vs
3500msec on 6.6.

I compared the source code for the two versions and found that different
merge functions were being used to merge the postings. In 5.4, the default
merge method of FieldsConsumer class was being used. While in 6.6, the
PerFieldPostingsFormat's merge method is being used. I checked and it looks
like this change went in Solr 6.3. So I replaced the 6.6 instance with 6.2.1
and re-indexed all the data, and it is working very well, even with the
settings I had initially used.

This is the issue that prompted the change:
https://issues.apache.org/jira/browse/LUCENE-7456

I plan to experiment with the settings provided by you and see if it further
helps our case. But out of curiosity I wanted to understand what is the
change in the two algorithms that it has such drastic effect on the merging
speed.

Yasoob



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Commit Thread Blocked because of excessive number of merging threads

2017-09-07 Thread yasoobhaider
Hi

My team has tasked me with upgrading Solr from the version we are using
(5.4) to the latest stable version 6.6. I am stuck for a few days now on the
indexing part.

First I'll list the requirements, then all the configuration settings I have
tried.

So in total I'm indexing about 2.5million documents. The average document
size is ~5KB. I have 10 (PHP) workers which are running in parallel, hitting
Solr with ~1K docs/minute. (This sometimes goes up to ~3K docs/minute).

System specifications:
RAM: 120G
Processors: 16

Solr configuration:
Heap size: 80G


solrconfig.xml: (Relevant parts; please let me know if there's anything else
you would like to look at)


  1
  380
  true



  ${solr.autoSoftCommit.maxTime:-1}


5000
1


  30
  30



  8
  7




The main problem:

When I start indexing everything is good until I reach about 2 million docs,
which takes ~10 hours. But then the  commitscheduler thread gets blocked. It
is stuck at doStall() in ConcurrentMergeScheduler(CMS). Looking at the logs
from InfoStream, I found "too many merges; stalling" message from the
commitscheduler thread, post which it gets stuck in the while loop forever.

Here's the check that's stalling our commitscheduler thread.

while (writer.hasPendingMerges() && mergeThreadCount() >= maxMergeCount) {
..
..
  if (verbose() && startStallTime == 0) {
message("too many merges; stalling...");
  }
  startStallTime = System.currentTimeMillis();
  doStall();
}
}

This is the reason I have put maxMergeCount and maxThreadCount explicitly in
my solrconfig. I thought increasing the number of threads would make sure
that there is always one extra thread for commit to go through. But now that
I have increased the allowed number of threads, Lucene just spawns that many
"Lucene Merge Thread"s and leaves none for when a commit comes along and
triggers a merge. And then it gets stuck forever.

Well, not really forever, I'm guessing that once one of the merging threads
is removed (by using removeMergeThread() in CMS) the commit will go through,
but for some reason, the merging is so slow that this doesn't happen (I gave
this a couple of hours of time, but commit thread was still stuck). Which
brings us to the second problem.



The second problem:
Merging is extremely slow. I'm not sure what I'm missing here. Maybe there's
a change in 6.x version which has significantly hampered merging speed. From
the thread dump, what I can see is that "Lucene Merge Thread"s are in the
Runnable state, and at TreeMap.getEntry() call. Is this normal?

Another thing I noticed was that the disk IO is throttled at ~20Mb/s. But
I'm not sure if this is something that can actually hamper merging.

My index size was ~10GB and I left it overnight (~6hours) and almost no
merging happened.

Here's another infoStream message from logs. Just putting it here in case it
helps.

-

2017-09-06 14:11:07.921 INFO  (qtp834133664-115) [   x:collection1]
o.a.s.u.LoggingInfoStream [MS][qtp834133664-115]: updateMergeThreads
ioThrottle=true targetMBPerSec=23.6 MB/sec
merge thread Lucene Merge Thread #4 estSize=5116.1 MB (written=4198.1 MB)
runTime=8100.1s (stopped=0.0s, paused=142.5s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #7 estSize=1414.3 MB (written=0.0 MB)
runTime=0.0s (stopped=0.0s, paused=0.0s) rate=23.6 MB/sec
  leave running at 23.6 MB/sec
merge thread Lucene Merge Thread #5 estSize=1014.4 MB (written=427.2 MB)
runTime=6341.9s (stopped=0.0s, paused=12.3s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #3 estSize=752.8 MB (written=362.8 MB)
runTime=8100.1s (stopped=0.0s, paused=12.4s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #2 estSize=312.5 MB (written=151.9 MB)
runTime=8100.7s (stopped=0.0s, paused=8.7s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #6 estSize=87.7 MB (written=63.0 MB)
runTime=3627.8s (stopped=0.0s, paused=0.9s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #1 estSize=57.3 MB (written=21.7 MB)
runTime=8101.2s (stopped=0.0s, paused=0.2s) rate=19.7 MB/sec
  now change from 19.7 MB/sec to 23.6 MB/sec
merge thread Lucene Merge Thread #0 estSize=4.6 MB (written=0.0 MB)
runTime=8101.0s (stopped=0.0s, paused=0.0s) rate=unlimited
  leave running at Infinity MB/sec

-

I also increased by maxMergeAtOnce and segmentsPerTier from 10 to 20 and
then to 30, in hopes of having fewer merging threads to be running at once,
but that

Re: CommitScheduler Thread blocked due to excessive number of Merging Threads

2017-09-07 Thread yasoobhaider
So I did a little more digging around why the merging is taking so long, and
it looks like merging postings is the culprit.

On the 5.4 version, merging 500 docs is taking approximately 100 msec, while
on the 6.6 version, it is taking more than 3000 msec. The difference seems
to get worse when more docs are being merged.

Any ideas why this may be the case?

Yasoob



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html