Hi Sravan, Glad to hear it helped! Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> On 4 Jan 2018, at 13:36, Sravan Kumar <sra...@caavo.com> wrote: > > Emir, > 'delete_by_query' is the cause for the replicas going to recover state. > I replaced it with delete_by_id as you suggested. Everything works fine > after that. The cluster held for nearly 3 hours without any failures. > Thanks Emir. > > > On Wed, Jan 3, 2018 at 8:41 PM, Emir Arnautović < > emir.arnauto...@sematext.com> wrote: > >> Hi Sravan, >> DBQ does not play well with indexing - it causes indexing to be completely >> blocked on replicas while it is running. It is highly likely that it is the >> root cause of your issues. If you can change indexing logic to avoid it, >> you can quickly test it. What you can do as a workaround is to query for >> IDs that needs to be deleted and execute bulk delete by ID - that will not >> cause issues as DBQ. >> >> HTH, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >>> On 3 Jan 2018, at 16:04, Sravan Kumar <sra...@caavo.com> wrote: >>> >>> Emir, >>> Yes there is a delete_by_query on every bulk insert. >>> This delete_by_query deletes all the documents which are updated >> lesser >>> than a day before the current time. >>> Is bulk delete_by_query the reason? >>> >>> On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović < >>> emir.arnauto...@sematext.com> wrote: >>> >>>> Do you have deletes by query while indexing or it is append only index? >>>> >>>> Regards, >>>> Emir >>>> -- >>>> Monitoring - Log Management - Alerting - Anomaly Detection >>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >>>> >>>> >>>> >>>>> On 3 Jan 2018, at 12:16, sravan <sra...@caavo.com> wrote: >>>>> >>>>> SolrCloud Nodes going to recovery state during indexing >>>>> >>>>> >>>>> We have solr cloud setup with the settings shared below. We have a >>>> collection with 3 shards and a replica for each of them. >>>>> >>>>> Normal State(As soon as the whole cluster is restarted): >>>>> - Status of all the shards is UP. >>>>> - a bulk update request of 50 documents each takes < 100ms. >>>>> - 6-10 simultaneous bulk updates. >>>>> >>>>> Nodes going to recover state after updates for 15-30 mins. >>>>> - Some shards starts giving the following ERRORs: >>>>> - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor. >>>> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async >>>> exception during distributed update: Read timed out >>>>> - o.a.s.u.StreamingSolrClients error java.net. >> SocketTimeoutException: >>>> Read timed out >>>>> - the following error is seen on the shard which goes to recovery >>>> state. >>>>> - too many updates received since start - startingUpdates no >>>> longer overlaps with our currentUpdates. >>>>> - Sometimes, the same shard even goes to DOWN state and needs a node >>>> restart to come back. >>>>> - a bulk update request of 50 documents takes more than 5 seconds. >>>> Sometimes even >120 secs. This is seen for all the requests if at least >> one >>>> node is in recovery state in the whole cluster. >>>>> >>>>> We have a standalone setup with the same collection schema which is >> able >>>> to take update & query load without any errors. >>>>> >>>>> >>>>> We have the following solrcloud setup. >>>>> - setup in AWS. >>>>> >>>>> - Zookeeper Setup: >>>>> - number of nodes: 3 >>>>> - aws instance type: t2.small >>>>> - instance memory: 2gb >>>>> >>>>> - Solr Setup: >>>>> - Solr version: 6.6.0 >>>>> - number of nodes: 3 >>>>> - aws instance type: m5.xlarge >>>>> - instance memory: 16gb >>>>> - number of cores: 4 >>>>> - JAVA HEAP: 8gb >>>>> - JAVA VERSION: oracle java version "1.8.0_151" >>>>> - GC settings: default CMS. >>>>> >>>>> collection settings: >>>>> - number of shards: 3 >>>>> - replication factor: 2 >>>>> - total 6 replicas. >>>>> - total number of documents in the collection: 12 million >>>>> - total number of documents in each shard: 4 million >>>>> - Each document has around 25 fields with 12 of them >>>> containing textual analysers & filters. >>>>> - Commit Strategy: >>>>> - No explicit commits from application code. >>>>> - Hard commit of 15 secs with OpenSearcher as false. >>>>> - Soft commit of 10 mins. >>>>> - Cache Strategy: >>>>> - filter queries >>>>> - number: 512 >>>>> - autowarmCount: 100 >>>>> - all other caches >>>>> - number: 512 >>>>> - autowarmCount: 0 >>>>> - maxWarmingSearchers: 2 >>>>> >>>>> >>>>> - We tried the following >>>>> - commit strategy >>>>> - hard commit - 150 secs >>>>> - soft commit - 5 mins >>>>> - with GCG1 garbage collector based on >> https://wiki.apache.org/solr/ >>>> ShawnHeisey#Java_8_recommendation_for_Solr: >>>>> - the nodes go to recover state in less than a minute. >>>>> >>>>> The issue is seen even when the leaders are balanced across the three >>>> nodes. >>>>> >>>>> Can you help us find the soluttion to this problem? >>>> >>>> >>> >>> >>> -- >>> Regards, >>> Sravan >> >> > > > -- > Regards, > Sravan