Re: recovering mode loop

Erick Erickson Wed, 23 Sep 2015 09:11:59 -0700

Wow, this is not expected at all. There's no
way you should, on the face of it, get
overlapping on-deck searchers.

I recommend you put your maxWarmingSearchers
back to 2, that's a fail-safe that is there to make
people look at why they're warming a bunch of
searchers at once.

With those settings, it's saying that autowarming is
taking over 10 minutes. This is absurdly long, so either
something is pathologically wrong with your Solr
or you're really committing more often than you think.
Possibly you have a client issuing commits? You
can look at your Solr logs and see commits, just
look for the word "commit". When reading those lines,
it'll say whether it has openSearcher true or false.
Are the timestamps when openSearcer=true really
10 minutes apart?

You'll also see autowarm times in your logs, see how
long they really take. If they really take 10 minutes,
we need to get to the bottom of that because the
autowarm counts you're showing in your cache
configurations don't indicate any problem here.

Bottom line:
1> you shouldn't be seeing nodes go into recovery in the
first place. Are your Solr logs showing any ERROR
level messages?

2> it's extremely surprising that you're getting any
overlapping on-deck searchers. If it turns out that
your autowarming is really taking more than a few
seconds, getting a stack trace to see where Solr is
spending all the time is warranted.

3> Any clues from the logs _why_ they're going
into recovery? Also look at your leader's log file
and see if there are any messages about "leader
initiated recovery". If you see that, then perhaps
one of the timeouts is too short.

4> the tlog size is quite reasonable. It's only relevant
when a node goes down for some reason anyway,
so I wouldn't expend too much energy worrying about
them until we get to the bottom of overlapping
searchers and nodes going into recovery.

BTW, nice job of laying out the relevant issues and
adding supporting information! I wish more problem
statements were as complete. If your Solr is 4.7.0,
there was a memory problem and you should definitely
go to 4.7.2. The symptom here is that you'll see
Out of Memory errors...

Best,
Erick

On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> Hi !,
>
> I keep getting nodes that fall into recovery mode and then issue the
> following log WARN every 10 seconds:
>
> WARN   Stopping recovery for core=xxxx coreNodeName=core_node7
>
> and sometimes this appears as well:
> PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> At higher traffic time, this gets worse and out of 4 nodes only 1 is up.
> I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB
> respectively. Core A gets a lot of index updates and higher query traffic
> compared to core B. Core A is going through active/recovery/down states
> very often.
> Nodes are coordinated via Zookeeper, we have three, running in different
> machines than Solr.
> Each machine has around 24 cores and between 38 and 48 GB of RAM, with each
> Solr getting 16GB of heap memory.
> I read this article:
>
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> and decided to apply:
>
>      <autoCommit>
>        <!-- Every 15 seconds -->
>        <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
>        <openSearcher>false</openSearcher>
>      </autoCommit>
>
> and
>
>      <autoSoftCommit>
>        <!-- Every 10 minutes -->
>        <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
>      </autoSoftCommit>
>
> I also have these cache configurations:
>
>     <filterCache class="solr.LFUCache"
>                  size="64"
>                  initialSize="64"
>                  autowarmCount="32"/>
>
>     <queryResultCache class="solr.LRUCache"
>                      size="512"
>                      initialSize="512"
>                      autowarmCount="0"/>
>
>     <documentCache class="solr.LRUCache"
>                    size="1024"
>                    initialSize="1024"
>                    autowarmCount="0"/>
>
>     <cache name="perSegFilter"
>       class="solr.search.LRUCache"
>       size="10"
>       initialSize="0"
>       autowarmCount="10"
>       regenerator="solr.NoOpRegenerator" />
>
>        <fieldValueCache class="solr.FastLRUCache"
>                         size="512"
>                         autowarmCount="0"
>                         showItems="32" />
>
> I also have this:
> <maxWarmingSearchers>6</maxWarmingSearchers>
> The size of the tlogs are usually between 1MB to 8MB.
> I thought the changes above could improve the situation, but I am not 100%
> convinced they did since after 15 min one of the nodes entered recovery
> mode again.
>
> any ideas ?
>
> Thanks in advance.
>
> Cheers !
>
> --
>
> --
> Lorenzo Fundaro
> Backend Engineer
> E-Mail: lorenzo.fund...@dawandamail.com
>
> Fax       + 49 - (0)30 - 25 76 08 52
> Tel        + 49 - (0)179 - 51 10 982
>
> DaWanda GmbH
> Windscheidstraße 18
> 10627 Berlin
>
> Geschäftsführer: Claudia Helming, Michael Pütz
> Amtsgericht Charlottenburg HRB 104695 B
>

Re: recovering mode loop

Reply via email to