Re: Replicas in Recovery During Atomic Updates

Anshuman Singh Wed, 19 Aug 2020 09:52:21 -0700

Hi,

Anyone has any idea about this issue?
Apart from the errors in the previous email, facing below errors frequently:


2020-08-19 11:56:09.467 ERROR (qtp1546693040-32) [c:collection_4 s:shard3
r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.u.SolrCmdDistributor
java.io.IOException: Request processing has stalled for 20017ms with 100
remaining elements in the queue.

2020-08-19 11:56:16.243 ERROR (qtp1546693040-72) [c:collection_4 s:shard3
r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.h.RequestHandlerBase
java.io.IOException: Task queue processing has stalled for 20216 ms with 0
remaining elements to process.

2020-08-19 11:56:22.584 ERROR (qtp1546693040-32) [c:collection_4 s:shard3
r:core_node13 x:collection_4_shard3_replica_n10]
o.a.s.u.p.DistributedZkUpdateProcessor Setting up to try to start recovery
on replica core_node11 with url
http://x.x.x.25:8983/solr/collection_4_shard3_replica_n8/ by increasing
leader term => java.io.IOException: Request processing has stalled for
20017ms with 100 remaining elements in the queue.

2020-08-19 11:56:16.064 ERROR (updateExecutor-5-thread-8-processing-null) [
  ] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req:
cmd=delete{_version_=-1675454745405292544,query=`{!cache=false}_expire_at_:[*
TO 2020-08-19T11:55:47.604Z]`,commitWithin=-1}; node=ForwardNode:
http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/ to
http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/ =>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/:
null


On Tue, Aug 11, 2020 at 2:08 AM Anshuman Singh <singhanshuma...@gmail.com>
wrote:

> Just to give you an idea, this is how we are ingesting:
>
> {"id": 1, "field1": {"inc": 20}, "field2": {"inc": 30}, "field3": 40.
> "field4": "some string"}
>
> We are using Solr-8.5.1. We have not configured any update processor. Hard
> commit happens every minute or at 100k docs, soft commit happens every 10
> mins.
> We have an external ZK setup with 5 nodes.
>
> Open files hard/soft limit is 65k and "max user processes" is unlimited.
>
> These are the different ERROR logs I found in the log files:
>
> ERROR (qtp1546693040-2637) [c:collection s:shard27 r:core_node109
> x:collection_shard27_replica_n106] o.a.s.s.HttpSolrCall
> null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
> Async exception during distributed update: java.net.ConnectException:
> Connection refused
>
> ERROR (qtp1546693040-1136) [c:collection s:shard101 r:core_node405
> x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall
> null:java.io.IOException: java.lang.InterruptedException
>
> ERROR (qtp1546693040-2704) [c:collection s:shard101 r:core_node405
> x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall
> null:org.eclipse.jetty.io.EofException: Reset cancel_stream_error
>
> ERROR (qtp1546693040-1344) [c:collection s:shard20 r:core_node79
> x:collection_shard20_replica_n76] o.a.s.h.RequestHandlerBase
> org.apache.solr.common.SolrException: No registered leader was found after
> waiting for 4000ms , collection: collection slice: shard48 saw
> state=DocCollection(collection//collections/collection/state.json/96434)={
>
> ERROR (qtp1546693040-2928) [c:collection s:shard80 r:core_node319
> x:collection_shard80_replica_n316] o.a.s.h.RequestHandlerBase
> org.apache.solr.common.SolrException: Request says it is coming from
> leader, but we are the leader
>
> ERROR (updateExecutor-5-thread-47-processing-n:192.100.20.19:8985_solr
> x:collection_shard161_replica_n641 c:collection s:shard161 r:core_node646)
> [c:collection s:shard161 r:core_node646 x:collection_shard161_replica_n641]
> o.a.s.u.SolrCmdDistributor
> org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException:
> Error from server at null: Expected mime type application/octet-stream but
> got application/json
>
> ERROR (recoveryExecutor-7-thread-16-processing-n:192.100.20.33:8984_solr
> x:collection_shard80_replica_n47 c:collection s:shard80 r:core_node48)
> [c:collection s:shard80 r:core_node48 x:collection_shard80_replica_n47]
> o.a.s.c.RecoveryStrategy Error while trying to recover.
> core=collection_shard80_replica_n47:java.util.concurrent.ExecutionException:
> org.apache.solr.client.solrj.SolrServerException: IOException occurred when
> talking to server at: http://192.100.20.34:8984/solr
>
> ERROR (zkCallback-10-thread-22) [c:collection s:shard19 r:core_node322
> x:collection_shard19_replica_n321] o.a.s.c.ShardLeaderElectionContext There
> was a problem trying to register as the
> leader:org.apache.solr.common.AlreadyClosedException
>
> ERROR
> (OverseerStateUpdate-176461820351853980-192.100.20.34:8985_solr-n_0000002357)
> [   ] o.a.s.c.Overseer Overseer could not process the current clusterstate
> state update message, skipping the message: {
>
> ERROR (main-EventThread) [   ] o.a.z.ClientCnxn Error while calling
> watcher  => java.lang.OutOfMemoryError: unable to create new native thread
>
> ERROR 
> (coreContainerWorkExecutor-2-thread-1-processing-n:192.100.20.34:8986_solr)
> [   ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on
> startup => org.apache.solr.cloud.ZkController$NotInClusterStateException:
> coreNodeName core_node638 does not exist in shard shard105, ignore the
> exception if the replica was deleted
>
> ERROR (qtp836220863-249) [c:collection s:shard162 r:core_node548
> x:collection_shard162_replica_n547] o.a.s.h.RequestHandlerBase
> org.apache.solr.common.SolrException: No registered leader was found after
> waiting for 4000ms , collection: collection slice: shard162 saw
> state=DocCollection(collection//collections/collection/state.json/43121)={
>
> Regards,
> Anshuman
>
> On Mon, Aug 10, 2020 at 9:19 PM Jörn Franke <jornfra...@gmail.com> wrote:
>
>> How do you ingest it exactly with Atomtic updates ? Is there an update
>> processor in-between?
>>
>> What are your settings for hard/soft commit?
>>
>> For the shared going to recovery - do you have a log entry or something ?
>>
>> What is the Solr version?
>>
>> How do you setup ZK?
>>
>> > Am 10.08.2020 um 16:24 schrieb Anshuman Singh <
>> singhanshuma...@gmail.com>:
>> >
>> > Hi,
>> >
>> > We have a SolrCloud cluster with 10 nodes. We have 6B records ingested
>> in
>> > the Collection. Our use case requires atomic updates ("inc") on 5
>> fields.
>> > Now almost 90% documents are atomic updates and as soon as we start our
>> > ingestion pipelines, multiple shards start going into recovery,
>> sometimes
>> > all replicas of some shards go into down state.
>> > The ingestion rate is also too slow with atomic updates, 4-5k per
>> second.
>> > We were able to ingest records without atomic updates at the rate of 50k
>> > records per second without any issues.
>> >
>> > What I'm suspecting is, the fact that these "inc" atomic updates
>> > require fetching of fields before indexing can cause slow rates but what
>> > I'm not getting is, why are the replicas going into recovery?
>> >
>> > Regards,
>> > Anshuman
>>
>

Re: Replicas in Recovery During Atomic Updates

Reply via email to