Re: SolrCloud Nodes going to recovery state during indexing

2018-01-04 Thread Emir Arnautović
Hi Sravan,
Glad to hear it helped!

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Jan 2018, at 13:36, Sravan Kumar  wrote:
> 
> Emir,
>   'delete_by_query' is the cause for the replicas going to recover state.
>   I replaced it with delete_by_id as you suggested. Everything works fine
> after that. The cluster held for nearly 3 hours without any failures.
>  Thanks Emir.
> 
> 
> On Wed, Jan 3, 2018 at 8:41 PM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Sravan,
>> DBQ does not play well with indexing - it causes indexing to be completely
>> blocked on replicas while it is running. It is highly likely that it is the
>> root cause of your issues. If you can change indexing logic to avoid it,
>> you can quickly test it. What you can do as a workaround is to query for
>> IDs that needs to be deleted and execute bulk delete by ID - that will not
>> cause issues as DBQ.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 3 Jan 2018, at 16:04, Sravan Kumar  wrote:
>>> 
>>> Emir,
>>>   Yes there is a delete_by_query on every bulk insert.
>>>   This delete_by_query deletes all the documents which are updated
>> lesser
>>> than a day before the current time.
>>>   Is bulk delete_by_query the reason?
>>> 
>>> On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
>>> emir.arnauto...@sematext.com> wrote:
>>> 
 Do you have deletes by query while indexing or it is append only index?
 
 Regards,
 Emir
 --
 Monitoring - Log Management - Alerting - Anomaly Detection
 Solr & Elasticsearch Consulting Support Training - http://sematext.com/
 
 
 
> On 3 Jan 2018, at 12:16, sravan  wrote:
> 
> SolrCloud Nodes going to recovery state during indexing
> 
> 
> We have solr cloud setup with the settings shared below. We have a
 collection with 3 shards and a replica for each of them.
> 
> Normal State(As soon as the whole cluster is restarted):
>   - Status of all the shards is UP.
>   - a bulk update request of 50 documents each takes < 100ms.
>   - 6-10 simultaneous bulk updates.
> 
> Nodes going to recover state after updates for 15-30 mins.
>   - Some shards starts giving the following ERRORs:
>   - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
 DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
 exception during distributed update: Read timed out
>   - o.a.s.u.StreamingSolrClients error java.net.
>> SocketTimeoutException:
 Read timed out
>   - the following error is seen on the shard which goes to recovery
 state.
>   - too many updates received since start - startingUpdates no
 longer overlaps with our currentUpdates.
>   - Sometimes, the same shard even goes to DOWN state and needs a node
 restart to come back.
>   - a bulk update request of 50 documents takes more than 5 seconds.
 Sometimes even >120 secs. This is seen for all the requests if at least
>> one
 node is in recovery state in the whole cluster.
> 
> We have a standalone setup with the same collection schema which is
>> able
 to take update & query load without any errors.
> 
> 
> We have the following solrcloud setup.
>   - setup in AWS.
> 
>   - Zookeeper Setup:
>   - number of nodes: 3
>   - aws instance type: t2.small
>   - instance memory: 2gb
> 
>   - Solr Setup:
>   - Solr version: 6.6.0
>   - number of nodes: 3
>   - aws instance type: m5.xlarge
>   - instance memory: 16gb
>   - number of cores: 4
>   - JAVA HEAP: 8gb
>   - JAVA VERSION: oracle java version "1.8.0_151"
>   - GC settings: default CMS.
> 
>   collection settings:
>   - number of shards: 3
>   - replication factor: 2
>   - total 6 replicas.
>   - total number of documents in the collection: 12 million
>   - total number of documents in each shard: 4 million
>   - Each document has around 25 fields with 12 of them
 containing textual analysers & filters.
>   - Commit Strategy:
>   - No explicit commits from application code.
>   - Hard commit of 15 secs with OpenSearcher as false.
>   - Soft commit of 10 mins.
>   - Cache Strategy:
>   - filter queries
>   - number: 512
>   - autowarmCount: 100
>   - all other caches
>   - number: 512
>   - autowarmCount: 0
>   - maxWarmingSearchers: 2
> 
> 
> - We tried the following

Re: SolrCloud Nodes going to recovery state during indexing

2018-01-04 Thread Sravan Kumar
Emir,
   'delete_by_query' is the cause for the replicas going to recover state.
   I replaced it with delete_by_id as you suggested. Everything works fine
after that. The cluster held for nearly 3 hours without any failures.
  Thanks Emir.


On Wed, Jan 3, 2018 at 8:41 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Sravan,
> DBQ does not play well with indexing - it causes indexing to be completely
> blocked on replicas while it is running. It is highly likely that it is the
> root cause of your issues. If you can change indexing logic to avoid it,
> you can quickly test it. What you can do as a workaround is to query for
> IDs that needs to be deleted and execute bulk delete by ID - that will not
> cause issues as DBQ.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 16:04, Sravan Kumar  wrote:
> >
> > Emir,
> >Yes there is a delete_by_query on every bulk insert.
> >This delete_by_query deletes all the documents which are updated
> lesser
> > than a day before the current time.
> >Is bulk delete_by_query the reason?
> >
> > On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Do you have deletes by query while indexing or it is append only index?
> >>
> >> Regards,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 3 Jan 2018, at 12:16, sravan  wrote:
> >>>
> >>> SolrCloud Nodes going to recovery state during indexing
> >>>
> >>>
> >>> We have solr cloud setup with the settings shared below. We have a
> >> collection with 3 shards and a replica for each of them.
> >>>
> >>> Normal State(As soon as the whole cluster is restarted):
> >>>- Status of all the shards is UP.
> >>>- a bulk update request of 50 documents each takes < 100ms.
> >>>- 6-10 simultaneous bulk updates.
> >>>
> >>> Nodes going to recover state after updates for 15-30 mins.
> >>>- Some shards starts giving the following ERRORs:
> >>>- o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
> >> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
> >> exception during distributed update: Read timed out
> >>>- o.a.s.u.StreamingSolrClients error java.net.
> SocketTimeoutException:
> >> Read timed out
> >>>- the following error is seen on the shard which goes to recovery
> >> state.
> >>>- too many updates received since start - startingUpdates no
> >> longer overlaps with our currentUpdates.
> >>>- Sometimes, the same shard even goes to DOWN state and needs a node
> >> restart to come back.
> >>>- a bulk update request of 50 documents takes more than 5 seconds.
> >> Sometimes even >120 secs. This is seen for all the requests if at least
> one
> >> node is in recovery state in the whole cluster.
> >>>
> >>> We have a standalone setup with the same collection schema which is
> able
> >> to take update & query load without any errors.
> >>>
> >>>
> >>> We have the following solrcloud setup.
> >>>- setup in AWS.
> >>>
> >>>- Zookeeper Setup:
> >>>- number of nodes: 3
> >>>- aws instance type: t2.small
> >>>- instance memory: 2gb
> >>>
> >>>- Solr Setup:
> >>>- Solr version: 6.6.0
> >>>- number of nodes: 3
> >>>- aws instance type: m5.xlarge
> >>>- instance memory: 16gb
> >>>- number of cores: 4
> >>>- JAVA HEAP: 8gb
> >>>- JAVA VERSION: oracle java version "1.8.0_151"
> >>>- GC settings: default CMS.
> >>>
> >>>collection settings:
> >>>- number of shards: 3
> >>>- replication factor: 2
> >>>- total 6 replicas.
> >>>- total number of documents in the collection: 12 million
> >>>- total number of documents in each shard: 4 million
> >>>- Each document has around 25 fields with 12 of them
> >> containing textual analysers & filters.
> >>>- Commit Strategy:
> >>>- No explicit commits from application code.
> >>>- Hard commit of 15 secs with OpenSearcher as false.
> >>>- Soft commit of 10 mins.
> >>>- Cache Strategy:
> >>>- filter queries
> >>>- number: 512
> >>>- autowarmCount: 100
> >>>- all other caches
> >>>- number: 512
> >>>- autowarmCount: 0
> >>>- maxWarmingSearchers: 2
> >>>
> >>>
> >>> - We tried the following
> >>>- commit strategy
> >>>- hard commit - 150 secs
> >>>- soft commit - 5 mins
> >>>- with GCG1 garbage collector based on
> https://wiki.apache.org/solr/
> >> ShawnHeisey#Java_8_recommendation_for_Solr:
> >>>- the nodes go

Re: SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread Emir Arnautović
Hi Sravan,
DBQ does not play well with indexing - it causes indexing to be completely 
blocked on replicas while it is running. It is highly likely that it is the 
root cause of your issues. If you can change indexing logic to avoid it, you 
can quickly test it. What you can do as a workaround is to query for IDs that 
needs to be deleted and execute bulk delete by ID - that will not cause issues 
as DBQ.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 16:04, Sravan Kumar  wrote:
> 
> Emir,
>Yes there is a delete_by_query on every bulk insert.
>This delete_by_query deletes all the documents which are updated lesser
> than a day before the current time.
>Is bulk delete_by_query the reason?
> 
> On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Do you have deletes by query while indexing or it is append only index?
>> 
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 3 Jan 2018, at 12:16, sravan  wrote:
>>> 
>>> SolrCloud Nodes going to recovery state during indexing
>>> 
>>> 
>>> We have solr cloud setup with the settings shared below. We have a
>> collection with 3 shards and a replica for each of them.
>>> 
>>> Normal State(As soon as the whole cluster is restarted):
>>>- Status of all the shards is UP.
>>>- a bulk update request of 50 documents each takes < 100ms.
>>>- 6-10 simultaneous bulk updates.
>>> 
>>> Nodes going to recover state after updates for 15-30 mins.
>>>- Some shards starts giving the following ERRORs:
>>>- o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
>> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
>> exception during distributed update: Read timed out
>>>- o.a.s.u.StreamingSolrClients error java.net.SocketTimeoutException:
>> Read timed out
>>>- the following error is seen on the shard which goes to recovery
>> state.
>>>- too many updates received since start - startingUpdates no
>> longer overlaps with our currentUpdates.
>>>- Sometimes, the same shard even goes to DOWN state and needs a node
>> restart to come back.
>>>- a bulk update request of 50 documents takes more than 5 seconds.
>> Sometimes even >120 secs. This is seen for all the requests if at least one
>> node is in recovery state in the whole cluster.
>>> 
>>> We have a standalone setup with the same collection schema which is able
>> to take update & query load without any errors.
>>> 
>>> 
>>> We have the following solrcloud setup.
>>>- setup in AWS.
>>> 
>>>- Zookeeper Setup:
>>>- number of nodes: 3
>>>- aws instance type: t2.small
>>>- instance memory: 2gb
>>> 
>>>- Solr Setup:
>>>- Solr version: 6.6.0
>>>- number of nodes: 3
>>>- aws instance type: m5.xlarge
>>>- instance memory: 16gb
>>>- number of cores: 4
>>>- JAVA HEAP: 8gb
>>>- JAVA VERSION: oracle java version "1.8.0_151"
>>>- GC settings: default CMS.
>>> 
>>>collection settings:
>>>- number of shards: 3
>>>- replication factor: 2
>>>- total 6 replicas.
>>>- total number of documents in the collection: 12 million
>>>- total number of documents in each shard: 4 million
>>>- Each document has around 25 fields with 12 of them
>> containing textual analysers & filters.
>>>- Commit Strategy:
>>>- No explicit commits from application code.
>>>- Hard commit of 15 secs with OpenSearcher as false.
>>>- Soft commit of 10 mins.
>>>- Cache Strategy:
>>>- filter queries
>>>- number: 512
>>>- autowarmCount: 100
>>>- all other caches
>>>- number: 512
>>>- autowarmCount: 0
>>>- maxWarmingSearchers: 2
>>> 
>>> 
>>> - We tried the following
>>>- commit strategy
>>>- hard commit - 150 secs
>>>- soft commit - 5 mins
>>>- with GCG1 garbage collector based on https://wiki.apache.org/solr/
>> ShawnHeisey#Java_8_recommendation_for_Solr:
>>>- the nodes go to recover state in less than a minute.
>>> 
>>> The issue is seen even when the leaders are balanced across the three
>> nodes.
>>> 
>>> Can you help us find the soluttion to this problem?
>> 
>> 
> 
> 
> -- 
> Regards,
> Sravan



Re: SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread Sravan Kumar
Emir,
Yes there is a delete_by_query on every bulk insert.
This delete_by_query deletes all the documents which are updated lesser
than a day before the current time.
Is bulk delete_by_query the reason?

On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Do you have deletes by query while indexing or it is append only index?
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 12:16, sravan  wrote:
> >
> > SolrCloud Nodes going to recovery state during indexing
> >
> >
> > We have solr cloud setup with the settings shared below. We have a
> collection with 3 shards and a replica for each of them.
> >
> > Normal State(As soon as the whole cluster is restarted):
> > - Status of all the shards is UP.
> > - a bulk update request of 50 documents each takes < 100ms.
> > - 6-10 simultaneous bulk updates.
> >
> > Nodes going to recover state after updates for 15-30 mins.
> > - Some shards starts giving the following ERRORs:
> > - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
> exception during distributed update: Read timed out
> > - o.a.s.u.StreamingSolrClients error 
> > java.net.SocketTimeoutException:
> Read timed out
> > - the following error is seen on the shard which goes to recovery
> state.
> > - too many updates received since start - startingUpdates no
> longer overlaps with our currentUpdates.
> > - Sometimes, the same shard even goes to DOWN state and needs a node
> restart to come back.
> > - a bulk update request of 50 documents takes more than 5 seconds.
> Sometimes even >120 secs. This is seen for all the requests if at least one
> node is in recovery state in the whole cluster.
> >
> > We have a standalone setup with the same collection schema which is able
> to take update & query load without any errors.
> >
> >
> > We have the following solrcloud setup.
> > - setup in AWS.
> >
> > - Zookeeper Setup:
> > - number of nodes: 3
> > - aws instance type: t2.small
> > - instance memory: 2gb
> >
> > - Solr Setup:
> > - Solr version: 6.6.0
> > - number of nodes: 3
> > - aws instance type: m5.xlarge
> > - instance memory: 16gb
> > - number of cores: 4
> > - JAVA HEAP: 8gb
> > - JAVA VERSION: oracle java version "1.8.0_151"
> > - GC settings: default CMS.
> >
> > collection settings:
> > - number of shards: 3
> > - replication factor: 2
> > - total 6 replicas.
> > - total number of documents in the collection: 12 million
> > - total number of documents in each shard: 4 million
> > - Each document has around 25 fields with 12 of them
> containing textual analysers & filters.
> > - Commit Strategy:
> > - No explicit commits from application code.
> > - Hard commit of 15 secs with OpenSearcher as false.
> > - Soft commit of 10 mins.
> > - Cache Strategy:
> > - filter queries
> > - number: 512
> > - autowarmCount: 100
> > - all other caches
> > - number: 512
> > - autowarmCount: 0
> > - maxWarmingSearchers: 2
> >
> >
> > - We tried the following
> > - commit strategy
> > - hard commit - 150 secs
> > - soft commit - 5 mins
> > - with GCG1 garbage collector based on https://wiki.apache.org/solr/
> ShawnHeisey#Java_8_recommendation_for_Solr:
> > - the nodes go to recover state in less than a minute.
> >
> > The issue is seen even when the leaders are balanced across the three
> nodes.
> >
> > Can you help us find the soluttion to this problem?
>
>


-- 
Regards,
Sravan


Re: SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread Emir Arnautović
Do you have deletes by query while indexing or it is append only index?

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 12:16, sravan  wrote:
> 
> SolrCloud Nodes going to recovery state during indexing
> 
> 
> We have solr cloud setup with the settings shared below. We have a collection 
> with 3 shards and a replica for each of them.
> 
> Normal State(As soon as the whole cluster is restarted):
> - Status of all the shards is UP.
> - a bulk update request of 50 documents each takes < 100ms.
> - 6-10 simultaneous bulk updates.
> 
> Nodes going to recover state after updates for 15-30 mins.
> - Some shards starts giving the following ERRORs:
> - o.a.s.h.RequestHandlerBase 
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during distributed update: Read timed out
> - o.a.s.u.StreamingSolrClients error java.net.SocketTimeoutException: 
> Read timed out
> - the following error is seen on the shard which goes to recovery state.
> - too many updates received since start - startingUpdates no longer 
> overlaps with our currentUpdates.
> - Sometimes, the same shard even goes to DOWN state and needs a node 
> restart to come back.
> - a bulk update request of 50 documents takes more than 5 seconds. 
> Sometimes even >120 secs. This is seen for all the requests if at least one 
> node is in recovery state in the whole cluster.
> 
> We have a standalone setup with the same collection schema which is able to 
> take update & query load without any errors.
> 
> 
> We have the following solrcloud setup.
> - setup in AWS.
> 
> - Zookeeper Setup:
> - number of nodes: 3
> - aws instance type: t2.small
> - instance memory: 2gb
> 
> - Solr Setup:
> - Solr version: 6.6.0
> - number of nodes: 3
> - aws instance type: m5.xlarge
> - instance memory: 16gb
> - number of cores: 4
> - JAVA HEAP: 8gb
> - JAVA VERSION: oracle java version "1.8.0_151"
> - GC settings: default CMS.
> 
> collection settings:
> - number of shards: 3
> - replication factor: 2
> - total 6 replicas.
> - total number of documents in the collection: 12 million
> - total number of documents in each shard: 4 million
> - Each document has around 25 fields with 12 of them containing 
> textual analysers & filters.
> - Commit Strategy:
> - No explicit commits from application code.
> - Hard commit of 15 secs with OpenSearcher as false.
> - Soft commit of 10 mins.
> - Cache Strategy:
> - filter queries
> - number: 512
> - autowarmCount: 100
> - all other caches
> - number: 512
> - autowarmCount: 0
> - maxWarmingSearchers: 2
> 
> 
> - We tried the following
> - commit strategy
> - hard commit - 150 secs
> - soft commit - 5 mins
> - with GCG1 garbage collector based on 
> https://wiki.apache.org/solr/ShawnHeisey#Java_8_recommendation_for_Solr:
> - the nodes go to recover state in less than a minute.
> 
> The issue is seen even when the leaders are balanced across the three nodes.
> 
> Can you help us find the soluttion to this problem?