Re: Solr nodes going into recovery mode and eventually failing
Hi Shamik, Please see incline comments/questions. Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 24 Oct 2017, at 07:41, shamik wrote: > > Thanks Emir and Zisis. > > I added the maxRamMB for filterCache and reduced the size. I could the > benefit immediately, the hit ratio went to 0.97. Here's the configuration: > > autowarmCount="128" maxRamMB="500" /> > autowarmCount="128" /> > autowarmCount=“0" /> [EA] Based on what you mentioned earlier, not all your filters are “cache friendly” and hit rate depends on your clients usage (or maybe how many concurrent clients you have) - in other words this hit ratio could be false positive. The one explanation is that you previously had count limitation and now memory and that memory ignores count (would need to check this) and that your cache items are on average smaller then estimated and more than 4K can fit into 500MB, but based on your query rate and commit interval, not sure if you needed more than 4K. > > It seemed to be stable for few days, the cache hits and jvm pool utilization > seemed to be well within expected range. But the OOM issue occurred on one > of the nodes as the heap size reached 30gb. The hit ratio for query result > cache and document cache at that point was recorded as 0.18 and 0.65. I'm > not sure if the cache caused the memory spike at this point, with filter > cache restricted to 500mb, it should be negligible. One thing I noticed is > that the eviction rate now (with the addition of maxRamMB) is staying at 0. [EA] Did you see evictions before? With 400rq/h and 10min commit intervals, you get 60-70 rq between two commits. With 4K cache size, each request should consume >60 cache entries in order to start evicting. > Index hard commit happens at every 10 min, that's when the cache gets > flushed. Based on the monitoring log, the spike happened on the indexing > side where almost 8k docs went to pending state. > > On the query performance standpoint, there have been occasional slow queries > (1sec+), but nothing alarming so far. Same goes for deep paging, I haven't > seen any evidence which points to that. > > Based on the hit ratio, I can further scale down the query result and > document cache, also change to FastLRUCache and add maxRamMB. For filter > cache, I think this setting should be optimal enough to work on a 30gb heap > space unless I'm wrong on the maxRamMB concept. I'll have to get a heap dump > somehow, unfortunately, the whole process (of the node going down) happens > so quickly, I’ve hardly any time to run a profiler. [EA] You did not mention if you ruled out fieldCache and fieldValueCache? I don’t have much experience with LTR, but see that there is another cache related to that. Do you use it? Could it be the component that consumes memory? > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
Thanks Emir and Zisis. I added the maxRamMB for filterCache and reduced the size. I could the benefit immediately, the hit ratio went to 0.97. Here's the configuration: It seemed to be stable for few days, the cache hits and jvm pool utilization seemed to be well within expected range. But the OOM issue occurred on one of the nodes as the heap size reached 30gb. The hit ratio for query result cache and document cache at that point was recorded as 0.18 and 0.65. I'm not sure if the cache caused the memory spike at this point, with filter cache restricted to 500mb, it should be negligible. One thing I noticed is that the eviction rate now (with the addition of maxRamMB) is staying at 0. Index hard commit happens at every 10 min, that's when the cache gets flushed. Based on the monitoring log, the spike happened on the indexing side where almost 8k docs went to pending state. On the query performance standpoint, there have been occasional slow queries (1sec+), but nothing alarming so far. Same goes for deep paging, I haven't seen any evidence which points to that. Based on the hit ratio, I can further scale down the query result and document cache, also change to FastLRUCache and add maxRamMB. For filter cache, I think this setting should be optimal enough to work on a 30gb heap space unless I'm wrong on the maxRamMB concept. I'll have to get a heap dump somehow, unfortunately, the whole process (of the node going down) happens so quickly, I've hardly any time to run a profiler. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
You mentioned hat you are on v. 6.6, but in case someone else uses this, just to add that maxRamMB is added to FastLRUCache in version 6.4. Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 23 Oct 2017, at 14:35, Zisis T. wrote: > > shamik wrote >> I was not aware of maxRamMB parameter, looks like it's only available for >> queryResultCache. Is that what you are referring to? Can you please share >> your cache configuration? > > I've setup filterCache entry inside solrconfig.xml as follows > > * autowarmCount="0" maxRamMB="120"/>* > > I had a look inside FastLRUCache code and saw that maxRamMB has precedence > over the size. I can also confirm that I had more than 512 entries inside > the cache, so the above will work. > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
shamik wrote > I was not aware of maxRamMB parameter, looks like it's only available for > queryResultCache. Is that what you are referring to? Can you please share > your cache configuration? I've setup filterCache entry inside solrconfig.xml as follows ** I had a look inside FastLRUCache code and saw that maxRamMB has precedence over the size. I can also confirm that I had more than 512 entries inside the cache, so the above will work. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
Hi Shamik, I agree that your filter cache is not the reason for OOMs. Can you confirm that your fieldCache and filedValueCache sizes are not consuming too much memory. The next on the list would be some heavy faceting with pivots, but you mentioned that all fields are low cardinality. Do you see any extremely slow queries in your logs? Can you check if there are some deep paging queries? If nothing else, you can always do heap dump and see what’s in it. And about your filterCache hit ratio: how frequently do you commit? With 400 rq/h it can be that filters are not repeated between two commits. Do you have high eviction rate? Thanks, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 20 Oct 2017, at 20:10, shamik wrote: > > Zisis, thanks for chiming in. This is really an interesting information and > probably in line what I'm trying to fix. In my case, the facet fields are > certainly not high cardinal ones. Most of them have a finite set of data, > the max being 200 (though it has a low usage percentage). Earlier I had > facet.limit=-1, but then scaled down to 200 to eliminate any performance > overhead. > > I was not aware of maxRamMB parameter, looks like it's only available for > queryResultCache. Is that what you are referring to? Can you please share > your cache configuration? > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
Zisis, thanks for chiming in. This is really an interesting information and probably in line what I'm trying to fix. In my case, the facet fields are certainly not high cardinal ones. Most of them have a finite set of data, the max being 200 (though it has a low usage percentage). Earlier I had facet.limit=-1, but then scaled down to 200 to eliminate any performance overhead. I was not aware of maxRamMB parameter, looks like it's only available for queryResultCache. Is that what you are referring to? Can you please share your cache configuration? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
Thanks Eric, in my case, each replica is running on it's own JVM, so even if we consider 8gb of filter cache, it still has 27gb to play with. Isn't this is a decent amount of memory to handle the rest of the JVM operation? Here's an example of implicit filters that get applied to almost all the queries. Except for Source2 and AccessMode, rest of the fields have doc values enabled. Our sorting is down mostly on relevance, so there's little impact there. fq=language:("english")&fq=ContentGroup:"Learn & Explore" OR "Getting Started" OR "Troubleshooting" OR "Downloads")&fq=Source2:("Help" OR "documentation" OR "video" OR (+Source2:"discussion" +Solution:"yes") OR "sfdcarticles" OR "downloads" OR "topicarticles" OR "screen" OR "blog" OR "simplecontent" OR "auonline" OR "contributedlink" OR "collection") AND -workflowparentid:[*+TO+*] AND -AccessMode:"internal" AND -AccessMode:"beta" AND -DisplayName:Partner AND -GlobalDedup:true AND -Exclude:"knowledge" AND -Exclude:"all" &bf=recip(ms(NOW/DAY,PublishDate),3.16e-11,1,1)^1.0 As you can see, there's a bunch, so filter cache is sort of important for us for performance standpoint. The hit ratio of 25% is abysmal and I don't think there are too many unique queries which are contributing to this. As I mentioned earlier, the increase in size parameter does improve the hit count. Just wondering, what are the best practices around scenarios like this? Looks like I've pretty much exhausted my options :-). -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
I'll post my experience too, I believe it might be related to the low FilterCache hit ratio issue. Please let me know if you think I'm off topic here to create a separate thread. I've run search stress tests on a 2 different Solr 6.5.1 installations sending Distributed search queries with facets (using facet.field, faceted fields have docValues="true") *1-shard* 1 shard of ~200GB/8M docs FilterCache with default 512 entries. 90% hit ratio *2-shards* 2 shards of ~100GB/4M docs each FilterCache with default 512 entries. 10% hit ratio. Huge drop in search performance Noticed that the majority of the FilterCache entries look like "filters on facet terms" and instead of a FixedBitSet which size is equal to the # of docs in the shard, it contains an int[] of the matching docids Key Value - FacetField1:Term1 -> int[] of matching docids FacetField1:Term2 -> int[] of matching docids FacetField2:Term3 -> int[] of matching docids FacetField2:Term4 -> int[] of matching docids Given that Field1 and Field2 are high cardinality fields there are too many keys in the cache but with few matched documents in most of the cases. Therefore since the cache values do not need so much memory, I ended up using *maxRamMB*=120 which in my case gives ~80% hit ratio allowing more entries in the cache and better control over consumed memory. This has been previously discussed here too http://lucene.472066.n3.nabble.com/Filter-cache-pollution-during-sharded-edismax-queries-td4074867.html#a4162151 Is this "overuse" of FilterCache normal in distributed search? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
Once you hit an OOM, the behavior of Java is indeterminate. There's no expectation that things will just pick up where they left off when memory his freed up. Lots of production systems have OOM killer scripts that automatically kill/restart Java apps that OOM for just that reason. Yes, each replica has it's own cache, but the JVM heap is used by them all. That's why "times the number of replica". Perhaps a more complete statement would be "times the number of replica hosted in the JVM". Hmmm, 11M docs. Let's take 16M, that would give 2M bytes/filterCache entry. Times 4096 gives around 8G that could be used up in by a cache that size. Yeah, your hit ratio is poor at 15%. It's relatively unusual to require that many entries though, what do the fq clauses look like? Or are you using something else that consumes cache (some facet methods do for instance). And do be sure to use docValues for any field you facet, sort or group on. Best, Erick On Thu, Oct 19, 2017 at 2:24 PM, shamik wrote: > Thanks Emir. The index is equally split between the two shards, each having > approx 35gb. The total number of documents is around 11 million which should > be distributed equally among the two shards. So, each core should take 3gb > of the heap for a full cache. Not sure I get the "multiply it by number of > replica". Shouldn't each replica have its own cache of 3gb? Moreover, based > on the SPM graph, the max filter cache size during the outages have been 1.5 > million max. > > Majority of our queries are heavily dependent on some implicit filter and > user selected ones. By reducing the filter cache size to the current one of > 4096 has taken a hit in performance. Earlier (in 5.5), I had a max cache > size of 10,000 (running on 15gb allocated heap) which produced a 95% hit > rate. With the memory issues in 6.6, I started reducing it to the current > value. It reduced the % hit to 25. I tried earlier reducing the value to > autowarmCount="0"/>. > It still didn't help which is when I decided to go for a higher RAM machine. > What I've noticed is that the heap is consistently around 22-23gb mark out > of which G1 old gen takes close to 13gb, G1 eden space around 6gb, rest > shared by G Survivor space, Metaspace and Code cache. > > This issue has been bothering me as I seemed to be running out of possible > tuning options. What I could see from the monitoring tool is the surge > period saw around 400 requests/hr with 40 docs/sec getting indexed. Is it a > really high volume of load to handle for a cluster size 6 nodes with 16 CPU > / 64gb RAM? What are the other options I should be looking into? > > The other thing which I'm still confused is why the recovery fails when the > memory has been freed up. > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
Thanks Emir. The index is equally split between the two shards, each having approx 35gb. The total number of documents is around 11 million which should be distributed equally among the two shards. So, each core should take 3gb of the heap for a full cache. Not sure I get the "multiply it by number of replica". Shouldn't each replica have its own cache of 3gb? Moreover, based on the SPM graph, the max filter cache size during the outages have been 1.5 million max. Majority of our queries are heavily dependent on some implicit filter and user selected ones. By reducing the filter cache size to the current one of 4096 has taken a hit in performance. Earlier (in 5.5), I had a max cache size of 10,000 (running on 15gb allocated heap) which produced a 95% hit rate. With the memory issues in 6.6, I started reducing it to the current value. It reduced the % hit to 25. I tried earlier reducing the value to . It still didn't help which is when I decided to go for a higher RAM machine. What I've noticed is that the heap is consistently around 22-23gb mark out of which G1 old gen takes close to 13gb, G1 eden space around 6gb, rest shared by G Survivor space, Metaspace and Code cache. This issue has been bothering me as I seemed to be running out of possible tuning options. What I could see from the monitoring tool is the surge period saw around 400 requests/hr with 40 docs/sec getting indexed. Is it a really high volume of load to handle for a cluster size 6 nodes with 16 CPU / 64gb RAM? What are the other options I should be looking into? The other thing which I'm still confused is why the recovery fails when the memory has been freed up. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr nodes going into recovery mode and eventually failing
Hi Shamik, I am pleased to see you find SPM useful! I think that your problems might be related to caches exhausting your memory. You mentioned that your index is 70GB, but how many documents it has? Remember that filter caches can take up to 1bit/doc. With 4096 filter cache size it means that full cache will take up to 0.5GB for 1 million documents (it is per core, so multiply it by number of replicas). If you have some fields without doc values check the sizes of fieldCaches and fieldValueCaches as well. Other thing that you should revisit is heap size - having heap sizes over 32GB prevents JVM from using compressed OOPS resulting in less object fitting heap of sizes just over 32GB than of heap of 31GB (the boundary depends on JVM so not exactly 32GB). HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 19 Oct 2017, at 01:12, Shamik Bandopadhyay wrote: > > Hi, > > I'm having this weird issue where Solr nodes suddenly go into recovery > mode and eventually failing. That one failure kicks off a cascading effect > and impacts the other nodes eventually. Without a restart, the entire > cluster goes into a limbo after a while. Looking into the log and SPM > monitoring tool, the issue happens under following circumstances: > 1. Node gets a spike in query/index request, thus exhausting its allocated > memory. > 2. GC forces CPU to use 100% of it's capacity > 3. None of the above, when both JVM and CPU are within limit > > I'm using Solr 6.6. Here are the details about the node : > > Hardware type: AWS m4.4xlarge instance > Total memory : 64 gb > CPU : 16 > SSD > SOLR_JAVA_MEM="-Xms35g -Xmx35g" > GC_TUNE="-XX:+UseG1GC \ > -XX:+ParallelRefProcEnabled \ > -XX:G1HeapRegionSize=8m \ > -XX:MaxGCPauseMillis=200 \ > -XX:+UseLargePages \ > -XX:+AggressiveOpts" > SOLR_OPTS="$SOLR_OPTS -Xss256k" > SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=60" > SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true" > SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=12" > > Cache Parameters: > 4096 > 1000 > autowarmCount="20"/> > autowarmCount="10"/> > autowarmCount="10"/> > > initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" /> > showItems="10" /> > class="solr.search.LRUCache" size="4096" initialSize="2048" > autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" /> > true > 60 > > I've currently 2 shards each having 2 replicas. The index size is > approximately 70gb. > > Here's a solr log trace from the series of events once the node starts > getting into trouble. I've posted only the relevant ones here. > > > org.apache.solr.common.SolrException.log(SolrException.java:148) - > org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates > are disabled. >at > org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1738) > > org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:143) > - It has been requested that we recover: core=knowledge > INFO647718[qtp2039328061-1526] - > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:732) > - [admin] webapp=null path=/admin/cores > params={core=knowledge&action=REQUESTRECOVERY&wt=javabin&version=2} > status=0 QTime=0 > INFO647808[qtp2039328061-1540] - > org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:187) > - [knowledge] webapp=/solr path=/update > params={update.distrib=FROMLEADER&distrib.from= > http://xx.xxx.xxx.63:8983/solr/knowledge/&wt=javabin&version=2}{} 0 0 > > WARN657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr > x:knowledge s:shard2 c:knowledge r:core_node9] - > org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659) > - Socket timeout on send prep recovery cmd, retrying.. > INFO657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr > x:knowledge s:shard2 c:knowledge r:core_node9] - > org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676) > - Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr]; > [WaitForState: > action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] > WARN667514[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr > x:knowledge s:shard2 c:knowledge r:core_node9] - > org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659) > - Socket timeout on send prep recovery cmd, retrying.. > > The retry happens for few times, then > > INFO689389[qtp2039328061-1649] - > org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147) > - request has come without principal. failed permission { > "name":"select", > "collection"
Solr nodes going into recovery mode and eventually failing
Hi, I'm having this weird issue where Solr nodes suddenly go into recovery mode and eventually failing. That one failure kicks off a cascading effect and impacts the other nodes eventually. Without a restart, the entire cluster goes into a limbo after a while. Looking into the log and SPM monitoring tool, the issue happens under following circumstances: 1. Node gets a spike in query/index request, thus exhausting its allocated memory. 2. GC forces CPU to use 100% of it's capacity 3. None of the above, when both JVM and CPU are within limit I'm using Solr 6.6. Here are the details about the node : Hardware type: AWS m4.4xlarge instance Total memory : 64 gb CPU : 16 SSD SOLR_JAVA_MEM="-Xms35g -Xmx35g" GC_TUNE="-XX:+UseG1GC \ -XX:+ParallelRefProcEnabled \ -XX:G1HeapRegionSize=8m \ -XX:MaxGCPauseMillis=200 \ -XX:+UseLargePages \ -XX:+AggressiveOpts" SOLR_OPTS="$SOLR_OPTS -Xss256k" SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=60" SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true" SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=12" Cache Parameters: 4096 1000 true 60 I've currently 2 shards each having 2 replicas. The index size is approximately 70gb. Here's a solr log trace from the series of events once the node starts getting into trouble. I've posted only the relevant ones here. org.apache.solr.common.SolrException.log(SolrException.java:148) - org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled. at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1738) org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:143) - It has been requested that we recover: core=knowledge INFO647718[qtp2039328061-1526] - org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:732) - [admin] webapp=null path=/admin/cores params={core=knowledge&action=REQUESTRECOVERY&wt=javabin&version=2} status=0 QTime=0 INFO647808[qtp2039328061-1540] - org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:187) - [knowledge] webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from= http://xx.xxx.xxx.63:8983/solr/knowledge/&wt=javabin&version=2}{} 0 0 WARN657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659) - Socket timeout on send prep recovery cmd, retrying.. INFO657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676) - Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr]; [WaitForState: action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] WARN667514[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659) - Socket timeout on send prep recovery cmd, retrying.. The retry happens for few times, then INFO689389[qtp2039328061-1649] - org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147) - request has come without principal. failed permission { "name":"select", "collection":"knowledge", "path":"/select", "role":[ "admin", "dev", "read"], "index":3} INFO689390[qtp2039328061-1649] - org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) - USER_REQUIRED auth header null context : userPrincipal: [null] type: [READ], collections: [knowledge,], Path: [/select] path : /select params :q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2 INFO907854[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr x:knowledge s:shard2 c:knowledge r:core_node9] - org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676) - Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr]; [WaitForState: action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true] INFO913811[commitScheduler-13-thread-1] - org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:603) - start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.SolrIndexWriter.setCommitData(SolrIndexWriter.java:174) - Calling setCommitData with IW:org.apache.solr.update.SolrIndexWriter@29f7c7fc INFO913812[commitScheduler-13-thread-1] - org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34) -