Hi Sravan,
Glad to hear it helped!

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Jan 2018, at 13:36, Sravan Kumar <sra...@caavo.com> wrote:
> 
> Emir,
>   'delete_by_query' is the cause for the replicas going to recover state.
>   I replaced it with delete_by_id as you suggested. Everything works fine
> after that. The cluster held for nearly 3 hours without any failures.
>  Thanks Emir.
> 
> 
> On Wed, Jan 3, 2018 at 8:41 PM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Sravan,
>> DBQ does not play well with indexing - it causes indexing to be completely
>> blocked on replicas while it is running. It is highly likely that it is the
>> root cause of your issues. If you can change indexing logic to avoid it,
>> you can quickly test it. What you can do as a workaround is to query for
>> IDs that needs to be deleted and execute bulk delete by ID - that will not
>> cause issues as DBQ.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 3 Jan 2018, at 16:04, Sravan Kumar <sra...@caavo.com> wrote:
>>> 
>>> Emir,
>>>   Yes there is a delete_by_query on every bulk insert.
>>>   This delete_by_query deletes all the documents which are updated
>> lesser
>>> than a day before the current time.
>>>   Is bulk delete_by_query the reason?
>>> 
>>> On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
>>> emir.arnauto...@sematext.com> wrote:
>>> 
>>>> Do you have deletes by query while indexing or it is append only index?
>>>> 
>>>> Regards,
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 3 Jan 2018, at 12:16, sravan <sra...@caavo.com> wrote:
>>>>> 
>>>>> SolrCloud Nodes going to recovery state during indexing
>>>>> 
>>>>> 
>>>>> We have solr cloud setup with the settings shared below. We have a
>>>> collection with 3 shards and a replica for each of them.
>>>>> 
>>>>> Normal State(As soon as the whole cluster is restarted):
>>>>>   - Status of all the shards is UP.
>>>>>   - a bulk update request of 50 documents each takes < 100ms.
>>>>>   - 6-10 simultaneous bulk updates.
>>>>> 
>>>>> Nodes going to recover state after updates for 15-30 mins.
>>>>>   - Some shards starts giving the following ERRORs:
>>>>>       - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
>>>> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
>>>> exception during distributed update: Read timed out
>>>>>       - o.a.s.u.StreamingSolrClients error java.net.
>> SocketTimeoutException:
>>>> Read timed out
>>>>>   - the following error is seen on the shard which goes to recovery
>>>> state.
>>>>>       - too many updates received since start - startingUpdates no
>>>> longer overlaps with our currentUpdates.
>>>>>   - Sometimes, the same shard even goes to DOWN state and needs a node
>>>> restart to come back.
>>>>>   - a bulk update request of 50 documents takes more than 5 seconds.
>>>> Sometimes even >120 secs. This is seen for all the requests if at least
>> one
>>>> node is in recovery state in the whole cluster.
>>>>> 
>>>>> We have a standalone setup with the same collection schema which is
>> able
>>>> to take update & query load without any errors.
>>>>> 
>>>>> 
>>>>> We have the following solrcloud setup.
>>>>>   - setup in AWS.
>>>>> 
>>>>>   - Zookeeper Setup:
>>>>>       - number of nodes: 3
>>>>>       - aws instance type: t2.small
>>>>>       - instance memory: 2gb
>>>>> 
>>>>>   - Solr Setup:
>>>>>       - Solr version: 6.6.0
>>>>>       - number of nodes: 3
>>>>>       - aws instance type: m5.xlarge
>>>>>       - instance memory: 16gb
>>>>>       - number of cores: 4
>>>>>       - JAVA HEAP: 8gb
>>>>>       - JAVA VERSION: oracle java version "1.8.0_151"
>>>>>       - GC settings: default CMS.
>>>>> 
>>>>>       collection settings:
>>>>>           - number of shards: 3
>>>>>           - replication factor: 2
>>>>>           - total 6 replicas.
>>>>>           - total number of documents in the collection: 12 million
>>>>>           - total number of documents in each shard: 4 million
>>>>>           - Each document has around 25 fields with 12 of them
>>>> containing textual analysers & filters.
>>>>>           - Commit Strategy:
>>>>>               - No explicit commits from application code.
>>>>>               - Hard commit of 15 secs with OpenSearcher as false.
>>>>>               - Soft commit of 10 mins.
>>>>>           - Cache Strategy:
>>>>>               - filter queries
>>>>>                   - number: 512
>>>>>                   - autowarmCount: 100
>>>>>               - all other caches
>>>>>                   - number: 512
>>>>>                   - autowarmCount: 0
>>>>>           - maxWarmingSearchers: 2
>>>>> 
>>>>> 
>>>>> - We tried the following
>>>>>   - commit strategy
>>>>>       - hard commit - 150 secs
>>>>>       - soft commit - 5 mins
>>>>>   - with GCG1 garbage collector based on
>> https://wiki.apache.org/solr/
>>>> ShawnHeisey#Java_8_recommendation_for_Solr:
>>>>>       - the nodes go to recover state in less than a minute.
>>>>> 
>>>>> The issue is seen even when the leaders are balanced across the three
>>>> nodes.
>>>>> 
>>>>> Can you help us find the soluttion to this problem?
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> Sravan
>> 
>> 
> 
> 
> -- 
> Regards,
> Sravan

Reply via email to