Thanks Erick and Emir -- we are going to start with <1> and possibly <2>.
On Thu, Oct 26, 2017 at 7:06 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Fengtan, > I would just add that when merging collections, you might want to use > document routing (https://lucene.apache.org/solr/guide/6_6/shards-and- > indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrClo > ud-DocumentRouting <https://lucene.apache.org/solr/guide/6_6/shards-and- > indexing-data-in-solrcloud.html#ShardsandIndexingDatainSolrClo > ud-DocumentRouting>) - since you are keeping separate collections, I > guess you have a “collection ID” to use as routing key. This will enable > you to have one collection but query only shard(s) with data from one > “collection”. > > HTH, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 25 Oct 2017, at 19:25, Erick Erickson <erickerick...@gmail.com> > wrote: > > > > <1> It's not the explicit commits are expensive, it's that they happen > > too fast. An explicit commit and an internal autocommit have exactly > > the same cost. Your "overlapping ondeck searchers" is definitely an > > indication that your commits are happening from somwhere too quickly > > and are piling up. > > > > <2> Likely a good thing, each collection increases overhead. And > > 1,000,000 documents is quite small in Solr's terms unless the > > individual documents are enormous. I'd do this for a number of > > reasons. > > > > <3> Certainly an option, but I'd put that last. Fix the commit problem > first ;) > > > > <4> If you do this, make the autowarm count quite small. That said, > > this will be very little use if you have frequent commits. Let's say > > you commit every second. The autowarming will warm caches, which will > > then be thrown out a second later. And will increase the time it takes > > to open a new searcher. > > > > <5> Yeah, this would probably just be a band-aid. > > > > If I were prioritizing these, I'd do > > <1> first. If you control the client, just don't call commit. If you > > do not control the client, then what you've outlined is fine. Tip: set > > your soft commit settings to be as long as you can stand. If you must > > have very short intervals, consider disabling your caches completely. > > Here's a long article on commits.... > > https://lucidworks.com/2013/08/23/understanding- > transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > > <2> Actually, this and <1> are pretty close in priority. > > > > Then re-evaluate. Fixing the commit issue may buy you quite a bit of > > time. Having 1,000 collections is pushing the boundaries presently. > > Each collection will establish watchers on the bits it cares about in > > ZooKeeper, and reducing the watchers by a factor approaching 1,000 is > > A Good Thing. > > > > Frankly, between these two things I'd pretty much expect your problems > > to disappear. wouldn't be the first time I've been totally wrong, but > > it's where I'd start ;) > > > > Best, > > Erick > > > > On Wed, Oct 25, 2017 at 8:54 AM, Fengtan <fengtan...@gmail.com> wrote: > >> Hi, > >> > >> We run a SolrCloud 6.4.2 cluster with ZooKeeper 3.4.6 on 3 VM's. > >> Each VM runs RHEL 7 with 16 GB RAM and 8 CPU and OpenJDK 1.8.0_131 ; > each > >> VM has one Solr and one ZK instance. > >> The cluster hosts 1,000 collections ; each collection has 1 shard and > >> between 500 and 50,000 documents. > >> Documents are indexed incrementally every day ; the Solr client mostly > does > >> searching. > >> Solr runs with -Xms7g -Xmx7g. > >> > >> Everything has been working fine for about one month but a few days ago > we > >> started to see Solr timeouts: https://pastebin.com/raw/E2prSrQm > >> > >> Also we have always seen these: > >> PERFORMANCE WARNING: Overlapping onDeckSearchers=2 > >> > >> > >> We are not sure what is causing the timeouts, although we have > identified a > >> few things that could be improved: > >> > >> 1) Ignore explicit commits using IgnoreCommitOptimizeUpdateProc > essorFactory > >> -- we are aware that explicit commits are expensive > >> > >> 2) Drop the 1,000 collections and use a single one instead (all our > >> collections use the same schema/solrconfig.xml) since stability problems > >> are expected when the number of collections reaches the low hundreds > >> <https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud>. The > >> downside is that the new collection would contain 1,000,000 documents > which > >> may bring new challenges. > >> > >> 3) Tune the GC and possibly switch from CMS to G1 as it seems to bring a > >> better performance according to this > >> <https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems > >, > >> this > >> <https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_ > First.29_Collector> > >> and this > >> <http://lucene.472066.n3.nabble.com/java-util- > concurrent-TimeoutException-Idle-timeout-expired-50001- > 50000-ms-td4321209.html>. > >> The downside is that Lucene explicitely discourages the usage of G1 > >> <https://wiki.apache.org/lucene-java/JavaBugs#Java_ > Bugs_in_various_JVMs_affecting_Lucene_.2F_Solr> > >> so we are not sure what to expect. We use the default GC settings: > >> -XX:NewRatio=3 > >> -XX:SurvivorRatio=4 > >> -XX:TargetSurvivorRatio=90 > >> -XX:MaxTenuringThreshold=8 > >> -XX:+UseConcMarkSweepGC > >> -XX:+UseParNewGC > >> -XX:ConcGCThreads=4 > >> -XX:ParallelGCThreads=4 > >> -XX:+CMSScavengeBeforeRemark > >> -XX:PretenureSizeThreshold=64m > >> -XX:+UseCMSInitiatingOccupancyOnly > >> -XX:CMSInitiatingOccupancyFraction=50 > >> -XX:CMSMaxAbortablePrecleanTime=6000 > >> -XX:+CMSParallelRemarkEnabled > >> -XX:+ParallelRefProcEnabled > >> > >> 4) Tune the caches, possibly by increasing autowarmCount on filterCache > -- > >> our current config is: > >> <filterCache class="solr.FastLRUCache" size="512" initialSize="512" > >> autowarmCount="0"/> > >> <queryResultCache class="solr.LRUCache" size="512" initialSize="512" > >> autowarmCount="32"/> > >> <documentCache class="solr.LRUCache" size="512" initialSize="512" > >> autowarmCount="0"/> > >> > >> 5) Tweak the timeout settings, although this would not fix the > underlying > >> issue > >> > >> > >> Does any of these options seem relevant ? Is there anything else that > might > >> address the timeouts ? > >> > >> Thanks > >