Wow, this is not expected at all. There's no way you should, on the face of it, get overlapping on-deck searchers.
I recommend you put your maxWarmingSearchers back to 2, that's a fail-safe that is there to make people look at why they're warming a bunch of searchers at once. With those settings, it's saying that autowarming is taking over 10 minutes. This is absurdly long, so either something is pathologically wrong with your Solr or you're really committing more often than you think. Possibly you have a client issuing commits? You can look at your Solr logs and see commits, just look for the word "commit". When reading those lines, it'll say whether it has openSearcher true or false. Are the timestamps when openSearcer=true really 10 minutes apart? You'll also see autowarm times in your logs, see how long they really take. If they really take 10 minutes, we need to get to the bottom of that because the autowarm counts you're showing in your cache configurations don't indicate any problem here. Bottom line: 1> you shouldn't be seeing nodes go into recovery in the first place. Are your Solr logs showing any ERROR level messages? 2> it's extremely surprising that you're getting any overlapping on-deck searchers. If it turns out that your autowarming is really taking more than a few seconds, getting a stack trace to see where Solr is spending all the time is warranted. 3> Any clues from the logs _why_ they're going into recovery? Also look at your leader's log file and see if there are any messages about "leader initiated recovery". If you see that, then perhaps one of the timeouts is too short. 4> the tlog size is quite reasonable. It's only relevant when a node goes down for some reason anyway, so I wouldn't expend too much energy worrying about them until we get to the bottom of overlapping searchers and nodes going into recovery. BTW, nice job of laying out the relevant issues and adding supporting information! I wish more problem statements were as complete. If your Solr is 4.7.0, there was a memory problem and you should definitely go to 4.7.2. The symptom here is that you'll see Out of Memory errors... Best, Erick On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró < lorenzo.fund...@dawandamail.com> wrote: > Hi !, > > I keep getting nodes that fall into recovery mode and then issue the > following log WARN every 10 seconds: > > WARN Stopping recovery for core=xxxx coreNodeName=core_node7 > > and sometimes this appears as well: > PERFORMANCE WARNING: Overlapping onDeckSearchers=2 > At higher traffic time, this gets worse and out of 4 nodes only 1 is up. > I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB > respectively. Core A gets a lot of index updates and higher query traffic > compared to core B. Core A is going through active/recovery/down states > very often. > Nodes are coordinated via Zookeeper, we have three, running in different > machines than Solr. > Each machine has around 24 cores and between 38 and 48 GB of RAM, with each > Solr getting 16GB of heap memory. > I read this article: > > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > and decided to apply: > > <autoCommit> > <!-- Every 15 seconds --> > <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> > <openSearcher>false</openSearcher> > </autoCommit> > > and > > <autoSoftCommit> > <!-- Every 10 minutes --> > <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime> > </autoSoftCommit> > > I also have these cache configurations: > > <filterCache class="solr.LFUCache" > size="64" > initialSize="64" > autowarmCount="32"/> > > <queryResultCache class="solr.LRUCache" > size="512" > initialSize="512" > autowarmCount="0"/> > > <documentCache class="solr.LRUCache" > size="1024" > initialSize="1024" > autowarmCount="0"/> > > <cache name="perSegFilter" > class="solr.search.LRUCache" > size="10" > initialSize="0" > autowarmCount="10" > regenerator="solr.NoOpRegenerator" /> > > <fieldValueCache class="solr.FastLRUCache" > size="512" > autowarmCount="0" > showItems="32" /> > > I also have this: > <maxWarmingSearchers>6</maxWarmingSearchers> > The size of the tlogs are usually between 1MB to 8MB. > I thought the changes above could improve the situation, but I am not 100% > convinced they did since after 15 min one of the nodes entered recovery > mode again. > > any ideas ? > > Thanks in advance. > > Cheers ! > > -- > > -- > Lorenzo Fundaro > Backend Engineer > E-Mail: lorenzo.fund...@dawandamail.com > > Fax + 49 - (0)30 - 25 76 08 52 > Tel + 49 - (0)179 - 51 10 982 > > DaWanda GmbH > Windscheidstraße 18 > 10627 Berlin > > Geschäftsführer: Claudia Helming, Michael Pütz > Amtsgericht Charlottenburg HRB 104695 B >