Re: Solr nodes going into recovery mode and eventually failing

2017-10-24 Thread Emir Arnautović
Hi Shamik,
Please see incline comments/questions.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 24 Oct 2017, at 07:41, shamik  wrote:
> 
> Thanks Emir and Zisis.
> 
> I added the maxRamMB for filterCache and reduced the size. I could the
> benefit immediately, the hit ratio went to 0.97. Here's the configuration:
> 
>  autowarmCount="128" maxRamMB="500" />
>  autowarmCount="128" />
>  autowarmCount=“0" />
[EA] Based on what you mentioned earlier, not all your filters are “cache 
friendly” and hit rate depends on your clients usage (or maybe how many 
concurrent clients you have) - in other words this hit ratio could be false 
positive. The one explanation is that you previously had count limitation and 
now memory and that memory ignores count (would need to check this) and that 
your cache items are on average smaller then estimated and more than 4K can fit 
into 500MB, but based on your query rate and commit interval, not sure if you 
needed more than 4K.

> 
> It seemed to be stable for few days, the cache hits and jvm pool utilization
> seemed to be well within expected range. But the OOM issue occurred on one
> of the nodes as the heap size reached 30gb. The hit ratio for query result
> cache and document cache at that point was recorded as 0.18 and 0.65. I'm
> not sure if the cache caused the memory spike at this point, with filter
> cache restricted to 500mb, it should be negligible. One thing I noticed is
> that the eviction rate now (with the addition of maxRamMB) is staying at 0.
[EA] Did you see evictions before? With 400rq/h and 10min commit intervals, you 
get 60-70 rq between two commits. With 4K cache size, each request should 
consume >60 cache entries in order to start evicting.

> Index hard commit happens at every 10 min, that's when the cache gets
> flushed. Based on the monitoring log, the spike happened on the indexing
> side where almost 8k docs went to pending state.
> 
> On the query performance standpoint, there have been occasional slow queries
> (1sec+), but nothing alarming so far. Same goes for deep paging, I haven't
> seen any evidence which points to that.
> 
> Based on the hit ratio, I can further scale down the query result and
> document cache, also change to FastLRUCache and add maxRamMB. For filter
> cache, I think this setting should be optimal enough to work on a 30gb heap
> space unless I'm wrong on the maxRamMB concept. I'll have to get a heap dump
> somehow, unfortunately, the whole process (of the node going down) happens
> so quickly, I’ve hardly any time to run a profiler.
[EA] You did not mention if you ruled out fieldCache and fieldValueCache?
I don’t have much experience with LTR, but see that there is another cache 
related to that. Do you use it? Could it be the component that consumes memory?

> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr nodes going into recovery mode and eventually failing

2017-10-23 Thread shamik
Thanks Emir and Zisis.

I added the maxRamMB for filterCache and reduced the size. I could the
benefit immediately, the hit ratio went to 0.97. Here's the configuration:





It seemed to be stable for few days, the cache hits and jvm pool utilization
seemed to be well within expected range. But the OOM issue occurred on one
of the nodes as the heap size reached 30gb. The hit ratio for query result
cache and document cache at that point was recorded as 0.18 and 0.65. I'm
not sure if the cache caused the memory spike at this point, with filter
cache restricted to 500mb, it should be negligible. One thing I noticed is
that the eviction rate now (with the addition of maxRamMB) is staying at 0.
Index hard commit happens at every 10 min, that's when the cache gets
flushed. Based on the monitoring log, the spike happened on the indexing
side where almost 8k docs went to pending state.

On the query performance standpoint, there have been occasional slow queries
(1sec+), but nothing alarming so far. Same goes for deep paging, I haven't
seen any evidence which points to that.

Based on the hit ratio, I can further scale down the query result and
document cache, also change to FastLRUCache and add maxRamMB. For filter
cache, I think this setting should be optimal enough to work on a 30gb heap
space unless I'm wrong on the maxRamMB concept. I'll have to get a heap dump
somehow, unfortunately, the whole process (of the node going down) happens
so quickly, I've hardly any time to run a profiler.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-23 Thread Emir Arnautović
You mentioned hat you are on v. 6.6, but in case someone else uses this, just 
to add that maxRamMB is added to FastLRUCache in version 6.4.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 23 Oct 2017, at 14:35, Zisis T.  wrote:
> 
> shamik wrote
>> I was not aware of maxRamMB parameter, looks like it's only available for
>> queryResultCache. Is that what you are referring to? Can you please share
>> your cache configuration?
> 
> I've setup filterCache entry inside solrconfig.xml as follows
> 
> * autowarmCount="0" maxRamMB="120"/>*
> 
> I had a look inside FastLRUCache code and saw that maxRamMB has precedence
> over the size. I can also confirm that I had more than 512 entries inside
> the cache, so the above will work. 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr nodes going into recovery mode and eventually failing

2017-10-23 Thread Zisis T.
shamik wrote
> I was not aware of maxRamMB parameter, looks like it's only available for
> queryResultCache. Is that what you are referring to? Can you please share
> your cache configuration?

I've setup filterCache entry inside solrconfig.xml as follows

**

I had a look inside FastLRUCache code and saw that maxRamMB has precedence
over the size. I can also confirm that I had more than 512 entries inside
the cache, so the above will work. 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-23 Thread Emir Arnautović
Hi Shamik,
I agree that your filter cache is not the reason for OOMs. Can you confirm that 
your fieldCache and filedValueCache sizes are not consuming too much memory. 
The next on the list would be some heavy faceting with pivots, but you 
mentioned that all fields are low cardinality. Do you see any extremely slow 
queries in your logs? Can you check if there are some deep paging queries?
If nothing else, you can always do heap dump and see what’s in it.

And about your filterCache hit ratio: how frequently do you commit? With 400 
rq/h it can be that filters are not repeated between two commits. Do you have 
high eviction rate?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 20 Oct 2017, at 20:10, shamik  wrote:
> 
> Zisis, thanks for chiming in. This is really an interesting information and
> probably in line what I'm trying to fix. In my case, the facet fields are
> certainly not high cardinal ones. Most of them have a finite set of data,
> the max being 200 (though it has a low usage percentage). Earlier I had
> facet.limit=-1, but then scaled down to 200 to eliminate any performance
> overhead.
> 
> I was not aware of maxRamMB parameter, looks like it's only available for
> queryResultCache. Is that what you are referring to? Can you please share
> your cache configuration? 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr nodes going into recovery mode and eventually failing

2017-10-20 Thread shamik
Zisis, thanks for chiming in. This is really an interesting information and
probably in line what I'm trying to fix. In my case, the facet fields are
certainly not high cardinal ones. Most of them have a finite set of data,
the max being 200 (though it has a low usage percentage). Earlier I had
facet.limit=-1, but then scaled down to 200 to eliminate any performance
overhead.

I was not aware of maxRamMB parameter, looks like it's only available for
queryResultCache. Is that what you are referring to? Can you please share
your cache configuration? 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-20 Thread shamik
Thanks Eric, in my case, each replica is running on it's own JVM, so even if
we consider 8gb of filter cache, it still has 27gb to play with. Isn't this
is a decent amount of memory to handle the rest of the JVM operation? 

Here's an example of implicit filters that get applied to almost all the
queries. Except for Source2 and  AccessMode, rest of the fields have doc
values enabled. Our sorting is down mostly on relevance, so there's little
impact there.

fq=language:("english")&fq=ContentGroup:"Learn & Explore" OR "Getting
Started" OR "Troubleshooting" OR "Downloads")&fq=Source2:("Help" OR
"documentation" OR "video" OR (+Source2:"discussion" +Solution:"yes") OR
"sfdcarticles" OR "downloads" OR "topicarticles" OR "screen" OR "blog" OR
"simplecontent" OR "auonline" OR "contributedlink" OR "collection") AND
-workflowparentid:[*+TO+*] AND -AccessMode:"internal" AND -AccessMode:"beta"
AND -DisplayName:Partner AND -GlobalDedup:true AND -Exclude:"knowledge" AND
-Exclude:"all" &bf=recip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^1.0

As you can see, there's a bunch, so filter cache is sort of important for us
for performance standpoint. The hit ratio of 25% is abysmal and I don't
think there are too many unique queries which are contributing to this. As I
mentioned earlier, the increase in size parameter does improve the hit
count. Just wondering, what are the best practices around scenarios like
this? Looks like I've pretty much exhausted my options :-).



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-20 Thread Zisis T.
I'll post my experience too, I believe it might be related to the low
FilterCache hit ratio issue. Please let me know if you think I'm off topic
here to create a separate thread.

I've run search stress tests on a 2 different Solr 6.5.1 installations
sending Distributed search queries with facets (using facet.field, faceted
fields have docValues="true")

*1-shard*
1 shard of ~200GB/8M docs
FilterCache with default 512 entries. 90% hit ratio

*2-shards*
2 shards of ~100GB/4M docs each
FilterCache with default 512 entries. 10% hit ratio. Huge drop in search
performance

Noticed that the majority of the FilterCache entries look like "filters on
facet terms" and instead of a FixedBitSet which size is equal to the # of
docs in the shard, it contains
an int[] of the matching docids

  Key   Value
-
FacetField1:Term1   ->  int[] of matching docids
FacetField1:Term2   ->  int[] of matching docids
FacetField2:Term3   ->  int[] of matching docids
FacetField2:Term4   ->  int[] of matching docids

Given that Field1 and Field2 are high cardinality fields there are too many
keys in the cache but with few matched documents in most of the cases.
Therefore since the cache values do not need so much memory, I ended up
using *maxRamMB*=120 which in my case gives ~80% hit ratio allowing more
entries in the cache and better control over consumed memory. 

This has been previously discussed here too
http://lucene.472066.n3.nabble.com/Filter-cache-pollution-during-sharded-edismax-queries-td4074867.html#a4162151

Is this "overuse" of FilterCache normal in distributed search? 



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-19 Thread Erick Erickson
Once you hit an OOM, the behavior of Java is indeterminate. There's no
expectation that things will just pick up where they left off when
memory his freed up. Lots of production systems have OOM killer
scripts that automatically kill/restart Java apps that OOM for just
that reason.

Yes, each replica has it's own cache, but the JVM heap is used by them
all. That's why "times the number of replica". Perhaps a more complete
statement would be "times the number of replica hosted in the JVM".

Hmmm, 11M docs. Let's take 16M, that would give 2M bytes/filterCache
entry. Times 4096 gives around 8G that could be used up in by a cache
that size.

Yeah, your hit ratio is poor at 15%. It's relatively unusual to
require that many entries though, what do the fq clauses look like? Or
are you using something else that consumes cache (some facet methods
do for instance).

And do be sure to use docValues for any field you facet, sort or group on.

Best,
Erick

On Thu, Oct 19, 2017 at 2:24 PM, shamik  wrote:
> Thanks Emir. The index is equally split between the two shards, each having
> approx 35gb. The total number of documents is around 11 million which should
> be distributed equally among the two shards. So, each core should take 3gb
> of the heap for a full cache. Not sure I get the "multiply it by number of
> replica". Shouldn't each replica have its own cache of 3gb? Moreover, based
> on the SPM graph, the max filter cache size during the outages have been 1.5
> million max.
>
> Majority of our queries are heavily dependent on some implicit filter and
> user selected ones. By reducing the filter cache size to the current one of
> 4096 has taken a hit in performance. Earlier (in 5.5), I had a max cache
> size of 10,000 (running on 15gb allocated heap)  which produced a 95% hit
> rate. With the memory issues in 6.6,  I started reducing it to the current
> value. It reduced the % hit to 25. I tried earlier reducing the value to
>  autowarmCount="0"/>.
> It still didn't help which is when I decided to go for a higher RAM machine.
> What I've noticed is that the heap is consistently around 22-23gb mark out
> of which G1 old gen takes close to 13gb, G1 eden space around 6gb, rest
> shared by G Survivor space, Metaspace and Code cache.
>
> This issue has been bothering me as I seemed to be running out of possible
> tuning options. What I could see from the monitoring tool is the surge
> period saw around 400 requests/hr with 40 docs/sec getting indexed. Is it a
> really high volume of load to handle for a cluster size 6 nodes with 16 CPU
> / 64gb RAM? What are the other options I should be looking into?
>
> The other thing which I'm still confused is why the recovery fails when the
> memory has been freed up.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-19 Thread shamik
Thanks Emir. The index is equally split between the two shards, each having
approx 35gb. The total number of documents is around 11 million which should
be distributed equally among the two shards. So, each core should take 3gb
of the heap for a full cache. Not sure I get the "multiply it by number of
replica". Shouldn't each replica have its own cache of 3gb? Moreover, based
on the SPM graph, the max filter cache size during the outages have been 1.5
million max.

Majority of our queries are heavily dependent on some implicit filter and
user selected ones. By reducing the filter cache size to the current one of
4096 has taken a hit in performance. Earlier (in 5.5), I had a max cache
size of 10,000 (running on 15gb allocated heap)  which produced a 95% hit
rate. With the memory issues in 6.6,  I started reducing it to the current
value. It reduced the % hit to 25. I tried earlier reducing the value to  
. 
It still didn't help which is when I decided to go for a higher RAM machine.
What I've noticed is that the heap is consistently around 22-23gb mark out
of which G1 old gen takes close to 13gb, G1 eden space around 6gb, rest
shared by G Survivor space, Metaspace and Code cache. 

This issue has been bothering me as I seemed to be running out of possible
tuning options. What I could see from the monitoring tool is the surge
period saw around 400 requests/hr with 40 docs/sec getting indexed. Is it a
really high volume of load to handle for a cluster size 6 nodes with 16 CPU
/ 64gb RAM? What are the other options I should be looking into? 

The other thing which I'm still confused is why the recovery fails when the
memory has been freed up.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr nodes going into recovery mode and eventually failing

2017-10-19 Thread Emir Arnautović
Hi Shamik,
I am pleased to see you find SPM useful!
I think that your problems might be related to caches exhausting your memory. 
You mentioned that your index is 70GB, but how many documents it has? Remember 
that filter caches can take up to 1bit/doc. With 4096 filter cache size it 
means that full cache will take up to 0.5GB for 1 million documents (it is per 
core, so multiply it by number of replicas). If you have some fields without 
doc values check the sizes of fieldCaches and fieldValueCaches as well.
Other thing that you should revisit is heap size - having heap sizes over 32GB 
prevents JVM from using compressed OOPS resulting in less object fitting heap 
of sizes just over 32GB than of heap of 31GB (the boundary depends on JVM so 
not exactly 32GB).

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Oct 2017, at 01:12, Shamik Bandopadhyay  wrote:
> 
> Hi,
> 
>  I'm having this weird issue where Solr nodes suddenly go into recovery
> mode and eventually failing. That one failure kicks off a cascading effect
> and impacts the other nodes eventually. Without a restart, the entire
> cluster goes into a limbo after a while. Looking into the log and SPM
> monitoring tool, the issue happens under following circumstances:
> 1. Node gets a spike in query/index request, thus exhausting its allocated
> memory.
> 2. GC forces CPU to use 100% of it's capacity
> 3. None of the above, when both JVM and CPU are within limit
> 
> I'm using Solr 6.6. Here are the details about the node :
> 
> Hardware type: AWS m4.4xlarge instance
> Total memory : 64 gb
> CPU : 16
> SSD
> SOLR_JAVA_MEM="-Xms35g -Xmx35g"
> GC_TUNE="-XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts"
> SOLR_OPTS="$SOLR_OPTS -Xss256k"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=60"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true"
> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=12"
> 
> Cache Parameters:
> 4096
> 1000
>  autowarmCount="20"/>
>  autowarmCount="10"/>
>  autowarmCount="10"/>
> 
>  initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
>  showItems="10" />
>  class="solr.search.LRUCache" size="4096" initialSize="2048"
> autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" />
> true
> 60
> 
> I've currently 2 shards each having 2 replicas. The index size is
> approximately 70gb.
> 
> Here's a solr log trace from the series of events once the node starts
> getting into trouble. I've posted only the relevant ones here.
> 
> 
> org.apache.solr.common.SolrException.log(SolrException.java:148) -
> org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates
> are disabled.
>at
> org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1738)
> 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:143)
> - It has been requested that we recover: core=knowledge
> INFO647718[qtp2039328061-1526] -
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:732)
> - [admin] webapp=null path=/admin/cores
> params={core=knowledge&action=REQUESTRECOVERY&wt=javabin&version=2}
> status=0 QTime=0
> INFO647808[qtp2039328061-1540] -
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:187)
> - [knowledge]  webapp=/solr path=/update
> params={update.distrib=FROMLEADER&distrib.from=
> http://xx.xxx.xxx.63:8983/solr/knowledge/&wt=javabin&version=2}{} 0 0
> 
> WARN657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659)
> - Socket timeout on send prep recovery cmd, retrying..
> INFO657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676)
> - Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr];
> [WaitForState:
> action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
> WARN667514[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659)
> - Socket timeout on send prep recovery cmd, retrying..
> 
> The retry happens for few times, then
> 
> INFO689389[qtp2039328061-1649] -
> org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147)
> - request has come without principal. failed permission {
>  "name":"select",
>  "collection"

Solr nodes going into recovery mode and eventually failing

2017-10-18 Thread Shamik Bandopadhyay
Hi,

  I'm having this weird issue where Solr nodes suddenly go into recovery
mode and eventually failing. That one failure kicks off a cascading effect
and impacts the other nodes eventually. Without a restart, the entire
cluster goes into a limbo after a while. Looking into the log and SPM
monitoring tool, the issue happens under following circumstances:
1. Node gets a spike in query/index request, thus exhausting its allocated
memory.
2. GC forces CPU to use 100% of it's capacity
3. None of the above, when both JVM and CPU are within limit

I'm using Solr 6.6. Here are the details about the node :

Hardware type: AWS m4.4xlarge instance
Total memory : 64 gb
CPU : 16
SSD
SOLR_JAVA_MEM="-Xms35g -Xmx35g"
GC_TUNE="-XX:+UseG1GC \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=8m \
-XX:MaxGCPauseMillis=200 \
-XX:+UseLargePages \
-XX:+AggressiveOpts"
SOLR_OPTS="$SOLR_OPTS -Xss256k"
SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=60"
SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true"
SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=12"

Cache Parameters:
4096
1000







true
60

I've currently 2 shards each having 2 replicas. The index size is
approximately 70gb.

Here's a solr log trace from the series of events once the node starts
getting into trouble. I've posted only the relevant ones here.


org.apache.solr.common.SolrException.log(SolrException.java:148) -
org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates
are disabled.
at
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1738)

 
org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:143)
- It has been requested that we recover: core=knowledge
INFO647718[qtp2039328061-1526] -
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:732)
- [admin] webapp=null path=/admin/cores
params={core=knowledge&action=REQUESTRECOVERY&wt=javabin&version=2}
status=0 QTime=0
INFO647808[qtp2039328061-1540] -
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:187)
- [knowledge]  webapp=/solr path=/update
params={update.distrib=FROMLEADER&distrib.from=
http://xx.xxx.xxx.63:8983/solr/knowledge/&wt=javabin&version=2}{} 0 0

WARN657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659)
- Socket timeout on send prep recovery cmd, retrying..
INFO657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676)
- Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr];
[WaitForState:
action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
WARN667514[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659)
- Socket timeout on send prep recovery cmd, retrying..

The retry happens for few times, then

INFO689389[qtp2039328061-1649] -
org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147)
- request has come without principal. failed permission {
  "name":"select",
  "collection":"knowledge",
  "path":"/select",
  "role":[
"admin",
"dev",
"read"],
  "index":3}
INFO689390[qtp2039328061-1649] -
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
USER_REQUIRED auth header null context : userPrincipal: [null] type:
[READ], collections: [knowledge,], Path: [/select] path : /select params
:q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2
INFO907854[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
x:knowledge s:shard2 c:knowledge r:core_node9] -
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676)
- Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr];
[WaitForState:
action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
INFO913811[commitScheduler-13-thread-1] -
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:603)
- start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.SolrIndexWriter.setCommitData(SolrIndexWriter.java:174)
- Calling setCommitData with
IW:org.apache.solr.update.SolrIndexWriter@29f7c7fc
 INFO913812[commitScheduler-13-thread-1] -
org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
-