Re: recovering mode loop

Lorenzo Fundaró Wed, 23 Sep 2015 10:41:12 -0700

here are the logs that didnt make it through the image: (sorry for the
misalignment on the logs)


9/23/2015, 7:14:49 PMERRORStreamingSolrClientserror

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
for connection from pool
        at 
org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
        at 
org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422)
        at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:233)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

9/23/2015, 7:14:49 PMWARNDistributedUpdateProcessorError sending update to
http://solr8.dawanda.services:8983/solr

org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
for connection from pool
        at 
org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
        at 
org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422)
        at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:233)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

9/23/2015, 7:14:49 PMWARNZkControllerLeader is publishing core=dawanda
coreNodeName =core_node10 state=down on behalf of un-reachable replica
http://solr8.dawanda.services:8983/solr/dawanda/; forcePublishState?
false9/23/2015,
7:14:49 PMERRORDistributedUpdateProcessorSetting up to try to start
recovery on replica http://solr8.dawanda.services:8983/solr/dawanda/ after:
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for
connection from pool9/23/2015, 7:14:49
PMERRORStreamingSolrClientserror9/23/2015,
7:14:49 PMWARNDistributedUpdateProcessorError sending update to
http://solr8.dawanda.services:8983/solr


I am pretty sure I saw the overlapping searches but I would forget that for
now, since it hasn't appeared again. Maybe it was an old log that I
confused.

I am gonna check the GC logs, and get back with some information.

>From the logs, I think the obvious action to take would be to increment the
zkClientTimeout to 30s (currently 15s)..

On 23 September 2015 at 18:43, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: and when one of the nodes enter recovery mode the leader says this:
>
> Hmmm, nothing came through, the mail filter is pretty aggressive
> about stripping attachments though.
>
> bq: You mean 10 seconds apart ?
>
> Hmmm, no I mean 10 minutes. That would
> explain the overlapping searchers since the only
> time you should be opening a new searcher is
> when you do a soft commit. But from the log
> fragment you pasted, that isn't happening so
> I'm totally mystified why you were getting those.
> Or was that some time ago and things have
> changed since then? I totally don't
> understand why you'd be seeing overlapping
> searchers, so grasping at straws. Given
> what you've added, I suspect this is totally
> off base.
>
> I see you're printing GC stats, are any of those
> GC times exceeding 15 seconds for stop-the-world
> pauses? That can trigger ZK marking the node as
> down and the node going into recovery. But with
> 16G heaps, that is unlikely, but you can see in
> your logs. If you are getting long stop-the-world
> pauses, perhaps you're very close to your JVM limit?
> In _some_ cases I've seen adding more memory
> decrease GC thrashing, the scenario is this:
>
> You're running very close to the limit of your JVM
> and each collection only reclaims a very little bit of
> memory (say 0.05%), so it goes immediately into
> another GC cycle. This is totally a shot in the dark
> though.
>
> Anyway, it's mystifying so far, but from everything
> you've written your system should not be going into
> recovery and you shouldn't be seeing
> overlapping searchers, so I'm missing something
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 9:25 AM, Lorenzo Fundaró <
> lorenzo.fund...@dawandamail.com> wrote:
>
> > On 23 September 2015 at 18:08, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> > > Wow, this is not expected at all. There's no
> > > way you should, on the face of it, get
> > > overlapping on-deck searchers.
> > >
> > > I recommend you put your maxWarmingSearchers
> > > back to 2, that's a fail-safe that is there to make
> > > people look at why they're warming a bunch of
> > > searchers at once.
> > >
> >
> > Ok, will do.
> >
> >
> > >
> > > With those settings, it's saying that autowarming is
> > > taking over 10 minutes.
> >
> >
> > What do you mean ? I don't think the autowarming is taking 10 minutes
> here.
> > Do you maybe mean 10 seconds apart ?
> > Every 10 min I issue a soft commit which would create a new searcher
> > with several autowarmings, one for each cache. Looking at the stats the
> > slowest autowarm is from filter cache and it takes around 2 seconds.
> >
> >
> > > This is absurdly long, so either
> > > something is pathologically wrong with your Solr
> > > or you're really committing more often than you think.
> > > Possibly you have a client issuing commits?
> >
> >
> > mm no, I checked my client and I am not doing any commit.
> > Here are some logs from the leader:
> >
> > INFO  - 2015-09-23 16:21:59.803;
> > org.apache.solr.update.DirectUpdateHandler2; start
> >
> >
> commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > INFO  - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy;
> > SolrDeletionPolicy.onCommit: commits: num=2
> > commit{dir=NRTCachingDirectory(MMapDirectory@
> > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
> > maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilp,generation=24109}
> > commit{dir=NRTCachingDirectory(MMapDirectory@
> > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
> > maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110}
> > INFO  - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy;
> > newest commit generation = 24110
> > INFO  - 2015-09-23 16:22:01.010;
> > org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> > INFO  - 2015-09-23 16:22:16.967;
> > org.apache.solr.update.DirectUpdateHandler2; start
> >
> >
> commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > INFO  - 2015-09-23 16:22:18.452; org.apache.solr.core.SolrDeletionPolicy;
> > SolrDeletionPolicy.onCommit: commits: num=2
> > commit{dir=NRTCachingDirectory(MMapDirectory@
> > /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> > lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
> > maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110}
> >
> > You can look at your Solr logs and see commits, just
> > > look for the word "commit". When reading those lines,
> > > it'll say whether it has openSearcher true or false.
> > > Are the timestamps when openSearcer=true really
> > > 10 minutes apart?
> > >
> >
> > You mean 10 seconds apart ?
> >
> >
> > >
> > > You'll also see autowarm times in your logs, see how
> > > long they really take. If they really take 10 minutes,
> > > we need to get to the bottom of that because the
> > > autowarm counts you're showing in your cache
> > > configurations don't indicate any problem here.
> > >
> > > Bottom line:
> > > 1> you shouldn't be seeing nodes go into recovery in the
> > > first place. Are your Solr logs showing any ERROR
> > > level messages?
> > >
> > > 2> it's extremely surprising that you're getting any
> > > overlapping on-deck searchers. If it turns out that
> > > your autowarming is really taking more than a few
> > > seconds, getting a stack trace to see where Solr is
> > > spending all the time is warranted.
> > >
> >
> > Will do.
> >
> > >
> > > 3> Any clues from the logs _why_ they're going
> > > into recovery? Also look at your leader's log file
> > > and see if there are any messages about "leader
> > > initiated recovery". If you see that, then perhaps
> > > one of the timeouts is too short.
> > >
> > > 4> the tlog size is quite reasonable. It's only relevant
> > > when a node goes down for some reason anyway,
> > > so I wouldn't expend too much energy worrying about
> > > them until we get to the bottom of overlapping
> > > searchers and nodes going into recovery.
> > >
> > > BTW, nice job of laying out the relevant issues and
> > > adding supporting information! I wish more problem
> > > statements were as complete. If your Solr is 4.7.0,
> > > there was a memory problem and you should definitely
> > > go to 4.7.2. The symptom here is that you'll see
> > > Out of Memory errors...
> > >
> > >
> > > Best,
> > > Erick
> >
> >
> > Thank you very much !
> >
> > >
> >
> >
> > > On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró <
> > > lorenzo.fund...@dawandamail.com> wrote:
> > >
> > > > Hi !,
> > > >
> > > > I keep getting nodes that fall into recovery mode and then issue the
> > > > following log WARN every 10 seconds:
> > > >
> > > > WARN   Stopping recovery for core=xxxx coreNodeName=core_node7
> > > >
> > > > and sometimes this appears as well:
> > > > PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> > > > At higher traffic time, this gets worse and out of 4 nodes only 1 is
> > up.
> > > > I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB
> > > > respectively. Core A gets a lot of index updates and higher query
> > traffic
> > > > compared to core B. Core A is going through active/recovery/down
> states
> > > > very often.
> > > > Nodes are coordinated via Zookeeper, we have three, running in
> > different
> > > > machines than Solr.
> > > > Each machine has around 24 cores and between 38 and 48 GB of RAM,
> with
> > > each
> > > > Solr getting 16GB of heap memory.
> > > > I read this article:
> > > >
> > > >
> > >
> >
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > >
> > > > and decided to apply:
> > > >
> > > >      <autoCommit>
> > > >        <!-- Every 15 seconds -->
> > > >        <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
> > > >        <openSearcher>false</openSearcher>
> > > >      </autoCommit>
> > > >
> > > > and
> > > >
> > > >      <autoSoftCommit>
> > > >        <!-- Every 10 minutes -->
> > > >        <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
> > > >      </autoSoftCommit>
> > > >
> > > > I also have these cache configurations:
> > > >
> > > >     <filterCache class="solr.LFUCache"
> > > >                  size="64"
> > > >                  initialSize="64"
> > > >                  autowarmCount="32"/>
> > > >
> > > >     <queryResultCache class="solr.LRUCache"
> > > >                      size="512"
> > > >                      initialSize="512"
> > > >                      autowarmCount="0"/>
> > > >
> > > >     <documentCache class="solr.LRUCache"
> > > >                    size="1024"
> > > >                    initialSize="1024"
> > > >                    autowarmCount="0"/>
> > > >
> > > >     <cache name="perSegFilter"
> > > >       class="solr.search.LRUCache"
> > > >       size="10"
> > > >       initialSize="0"
> > > >       autowarmCount="10"
> > > >       regenerator="solr.NoOpRegenerator" />
> > > >
> > > >        <fieldValueCache class="solr.FastLRUCache"
> > > >                         size="512"
> > > >                         autowarmCount="0"
> > > >                         showItems="32" />
> > > >
> > > > I also have this:
> > > > <maxWarmingSearchers>6</maxWarmingSearchers>
> > > > The size of the tlogs are usually between 1MB to 8MB.
> > > > I thought the changes above could improve the situation, but I am not
> > > 100%
> > > > convinced they did since after 15 min one of the nodes entered
> recovery
> > > > mode again.
> > > >
> > > > any ideas ?
> > > >
> > > > Thanks in advance.
> > > >
> > > > Cheers !
> > > >
> > > > --
> > > >
> > > > --
> > > > Lorenzo Fundaro
> > > > Backend Engineer
> > > > E-Mail: lorenzo.fund...@dawandamail.com
> > > >
> > > > Fax       + 49 - (0)30 - 25 76 08 52
> > > > Tel        + 49 - (0)179 - 51 10 982
> > > >
> > > > DaWanda GmbH
> > > > Windscheidstraße 18
> > > > 10627 Berlin
> > > >
> > > > Geschäftsführer: Claudia Helming, Michael Pütz
> > > > Amtsgericht Charlottenburg HRB 104695 B
> > > >
> > >
> >
> >
> >
> > --
> >
> > --
> > Lorenzo Fundaro
> > Backend Engineer
> > E-Mail: lorenzo.fund...@dawandamail.com
> >
> > Fax       + 49 - (0)30 - 25 76 08 52
> > Tel        + 49 - (0)179 - 51 10 982
> >
> > DaWanda GmbH
> > Windscheidstraße 18
> > 10627 Berlin
> >
> > Geschäftsführer: Claudia Helming, Michael Pütz
> > Amtsgericht Charlottenburg HRB 104695 B
> >
>



-- 

-- 
Lorenzo Fundaro
Backend Engineer
E-Mail: lorenzo.fund...@dawandamail.com

Fax       + 49 - (0)30 - 25 76 08 52
Tel        + 49 - (0)179 - 51 10 982

DaWanda GmbH
Windscheidstraße 18
10627 Berlin

Geschäftsführer: Claudia Helming, Michael Pütz
Amtsgericht Charlottenburg HRB 104695 B

Re: recovering mode loop

Reply via email to