here are the logs that didnt make it through the image: (sorry for the misalignment on the logs)
9/23/2015, 7:14:49 PMERRORStreamingSolrClientserror org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 9/23/2015, 7:14:49 PMWARNDistributedUpdateProcessorError sending update to http://solr8.dawanda.services:8983/solr org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 9/23/2015, 7:14:49 PMWARNZkControllerLeader is publishing core=dawanda coreNodeName =core_node10 state=down on behalf of un-reachable replica http://solr8.dawanda.services:8983/solr/dawanda/; forcePublishState? false9/23/2015, 7:14:49 PMERRORDistributedUpdateProcessorSetting up to try to start recovery on replica http://solr8.dawanda.services:8983/solr/dawanda/ after: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool9/23/2015, 7:14:49 PMERRORStreamingSolrClientserror9/23/2015, 7:14:49 PMWARNDistributedUpdateProcessorError sending update to http://solr8.dawanda.services:8983/solr I am pretty sure I saw the overlapping searches but I would forget that for now, since it hasn't appeared again. Maybe it was an old log that I confused. I am gonna check the GC logs, and get back with some information. >From the logs, I think the obvious action to take would be to increment the zkClientTimeout to 30s (currently 15s).. On 23 September 2015 at 18:43, Erick Erickson <erickerick...@gmail.com> wrote: > bq: and when one of the nodes enter recovery mode the leader says this: > > Hmmm, nothing came through, the mail filter is pretty aggressive > about stripping attachments though. > > bq: You mean 10 seconds apart ? > > Hmmm, no I mean 10 minutes. That would > explain the overlapping searchers since the only > time you should be opening a new searcher is > when you do a soft commit. But from the log > fragment you pasted, that isn't happening so > I'm totally mystified why you were getting those. > Or was that some time ago and things have > changed since then? I totally don't > understand why you'd be seeing overlapping > searchers, so grasping at straws. Given > what you've added, I suspect this is totally > off base. > > I see you're printing GC stats, are any of those > GC times exceeding 15 seconds for stop-the-world > pauses? That can trigger ZK marking the node as > down and the node going into recovery. But with > 16G heaps, that is unlikely, but you can see in > your logs. If you are getting long stop-the-world > pauses, perhaps you're very close to your JVM limit? > In _some_ cases I've seen adding more memory > decrease GC thrashing, the scenario is this: > > You're running very close to the limit of your JVM > and each collection only reclaims a very little bit of > memory (say 0.05%), so it goes immediately into > another GC cycle. This is totally a shot in the dark > though. > > Anyway, it's mystifying so far, but from everything > you've written your system should not be going into > recovery and you shouldn't be seeing > overlapping searchers, so I'm missing something > > Best, > Erick > > On Wed, Sep 23, 2015 at 9:25 AM, Lorenzo Fundaró < > lorenzo.fund...@dawandamail.com> wrote: > > > On 23 September 2015 at 18:08, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > > > Wow, this is not expected at all. There's no > > > way you should, on the face of it, get > > > overlapping on-deck searchers. > > > > > > I recommend you put your maxWarmingSearchers > > > back to 2, that's a fail-safe that is there to make > > > people look at why they're warming a bunch of > > > searchers at once. > > > > > > > Ok, will do. > > > > > > > > > > With those settings, it's saying that autowarming is > > > taking over 10 minutes. > > > > > > What do you mean ? I don't think the autowarming is taking 10 minutes > here. > > Do you maybe mean 10 seconds apart ? > > Every 10 min I issue a soft commit which would create a new searcher > > with several autowarmings, one for each cache. Looking at the stats the > > slowest autowarm is from filter cache and it takes around 2 seconds. > > > > > > > This is absurdly long, so either > > > something is pathologically wrong with your Solr > > > or you're really committing more often than you think. > > > Possibly you have a client issuing commits? > > > > > > mm no, I checked my client and I am not doing any commit. > > Here are some logs from the leader: > > > > INFO - 2015-09-23 16:21:59.803; > > org.apache.solr.update.DirectUpdateHandler2; start > > > > > commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} > > INFO - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy; > > SolrDeletionPolicy.onCommit: commits: num=2 > > commit{dir=NRTCachingDirectory(MMapDirectory@ > > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229 > > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb; > > maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilp,generation=24109} > > commit{dir=NRTCachingDirectory(MMapDirectory@ > > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229 > > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb; > > maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110} > > INFO - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy; > > newest commit generation = 24110 > > INFO - 2015-09-23 16:22:01.010; > > org.apache.solr.update.DirectUpdateHandler2; end_commit_flush > > INFO - 2015-09-23 16:22:16.967; > > org.apache.solr.update.DirectUpdateHandler2; start > > > > > commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} > > INFO - 2015-09-23 16:22:18.452; org.apache.solr.core.SolrDeletionPolicy; > > SolrDeletionPolicy.onCommit: commits: num=2 > > commit{dir=NRTCachingDirectory(MMapDirectory@ > > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229 > > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb; > > maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110} > > > > You can look at your Solr logs and see commits, just > > > look for the word "commit". When reading those lines, > > > it'll say whether it has openSearcher true or false. > > > Are the timestamps when openSearcer=true really > > > 10 minutes apart? > > > > > > > You mean 10 seconds apart ? > > > > > > > > > > You'll also see autowarm times in your logs, see how > > > long they really take. If they really take 10 minutes, > > > we need to get to the bottom of that because the > > > autowarm counts you're showing in your cache > > > configurations don't indicate any problem here. > > > > > > Bottom line: > > > 1> you shouldn't be seeing nodes go into recovery in the > > > first place. Are your Solr logs showing any ERROR > > > level messages? > > > > > > 2> it's extremely surprising that you're getting any > > > overlapping on-deck searchers. If it turns out that > > > your autowarming is really taking more than a few > > > seconds, getting a stack trace to see where Solr is > > > spending all the time is warranted. > > > > > > > Will do. > > > > > > > > 3> Any clues from the logs _why_ they're going > > > into recovery? Also look at your leader's log file > > > and see if there are any messages about "leader > > > initiated recovery". If you see that, then perhaps > > > one of the timeouts is too short. > > > > > > 4> the tlog size is quite reasonable. It's only relevant > > > when a node goes down for some reason anyway, > > > so I wouldn't expend too much energy worrying about > > > them until we get to the bottom of overlapping > > > searchers and nodes going into recovery. > > > > > > BTW, nice job of laying out the relevant issues and > > > adding supporting information! I wish more problem > > > statements were as complete. If your Solr is 4.7.0, > > > there was a memory problem and you should definitely > > > go to 4.7.2. The symptom here is that you'll see > > > Out of Memory errors... > > > > > > > > > Best, > > > Erick > > > > > > Thank you very much ! > > > > > > > > > > > > On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró < > > > lorenzo.fund...@dawandamail.com> wrote: > > > > > > > Hi !, > > > > > > > > I keep getting nodes that fall into recovery mode and then issue the > > > > following log WARN every 10 seconds: > > > > > > > > WARN Stopping recovery for core=xxxx coreNodeName=core_node7 > > > > > > > > and sometimes this appears as well: > > > > PERFORMANCE WARNING: Overlapping onDeckSearchers=2 > > > > At higher traffic time, this gets worse and out of 4 nodes only 1 is > > up. > > > > I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB > > > > respectively. Core A gets a lot of index updates and higher query > > traffic > > > > compared to core B. Core A is going through active/recovery/down > states > > > > very often. > > > > Nodes are coordinated via Zookeeper, we have three, running in > > different > > > > machines than Solr. > > > > Each machine has around 24 cores and between 38 and 48 GB of RAM, > with > > > each > > > > Solr getting 16GB of heap memory. > > > > I read this article: > > > > > > > > > > > > > > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > > > > > > and decided to apply: > > > > > > > > <autoCommit> > > > > <!-- Every 15 seconds --> > > > > <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> > > > > <openSearcher>false</openSearcher> > > > > </autoCommit> > > > > > > > > and > > > > > > > > <autoSoftCommit> > > > > <!-- Every 10 minutes --> > > > > <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime> > > > > </autoSoftCommit> > > > > > > > > I also have these cache configurations: > > > > > > > > <filterCache class="solr.LFUCache" > > > > size="64" > > > > initialSize="64" > > > > autowarmCount="32"/> > > > > > > > > <queryResultCache class="solr.LRUCache" > > > > size="512" > > > > initialSize="512" > > > > autowarmCount="0"/> > > > > > > > > <documentCache class="solr.LRUCache" > > > > size="1024" > > > > initialSize="1024" > > > > autowarmCount="0"/> > > > > > > > > <cache name="perSegFilter" > > > > class="solr.search.LRUCache" > > > > size="10" > > > > initialSize="0" > > > > autowarmCount="10" > > > > regenerator="solr.NoOpRegenerator" /> > > > > > > > > <fieldValueCache class="solr.FastLRUCache" > > > > size="512" > > > > autowarmCount="0" > > > > showItems="32" /> > > > > > > > > I also have this: > > > > <maxWarmingSearchers>6</maxWarmingSearchers> > > > > The size of the tlogs are usually between 1MB to 8MB. > > > > I thought the changes above could improve the situation, but I am not > > > 100% > > > > convinced they did since after 15 min one of the nodes entered > recovery > > > > mode again. > > > > > > > > any ideas ? > > > > > > > > Thanks in advance. > > > > > > > > Cheers ! > > > > > > > > -- > > > > > > > > -- > > > > Lorenzo Fundaro > > > > Backend Engineer > > > > E-Mail: lorenzo.fund...@dawandamail.com > > > > > > > > Fax + 49 - (0)30 - 25 76 08 52 > > > > Tel + 49 - (0)179 - 51 10 982 > > > > > > > > DaWanda GmbH > > > > Windscheidstraße 18 > > > > 10627 Berlin > > > > > > > > Geschäftsführer: Claudia Helming, Michael Pütz > > > > Amtsgericht Charlottenburg HRB 104695 B > > > > > > > > > > > > > > > -- > > > > -- > > Lorenzo Fundaro > > Backend Engineer > > E-Mail: lorenzo.fund...@dawandamail.com > > > > Fax + 49 - (0)30 - 25 76 08 52 > > Tel + 49 - (0)179 - 51 10 982 > > > > DaWanda GmbH > > Windscheidstraße 18 > > 10627 Berlin > > > > Geschäftsführer: Claudia Helming, Michael Pütz > > Amtsgericht Charlottenburg HRB 104695 B > > > -- -- Lorenzo Fundaro Backend Engineer E-Mail: lorenzo.fund...@dawandamail.com Fax + 49 - (0)30 - 25 76 08 52 Tel + 49 - (0)179 - 51 10 982 DaWanda GmbH Windscheidstraße 18 10627 Berlin Geschäftsführer: Claudia Helming, Michael Pütz Amtsgericht Charlottenburg HRB 104695 B