Hi, Anyone has any idea about this issue? Apart from the errors in the previous email, facing below errors frequently:
2020-08-19 11:56:09.467 ERROR (qtp1546693040-32) [c:collection_4 s:shard3 r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.u.SolrCmdDistributor java.io.IOException: Request processing has stalled for 20017ms with 100 remaining elements in the queue. 2020-08-19 11:56:16.243 ERROR (qtp1546693040-72) [c:collection_4 s:shard3 r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.h.RequestHandlerBase java.io.IOException: Task queue processing has stalled for 20216 ms with 0 remaining elements to process. 2020-08-19 11:56:22.584 ERROR (qtp1546693040-32) [c:collection_4 s:shard3 r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.u.p.DistributedZkUpdateProcessor Setting up to try to start recovery on replica core_node11 with url http://x.x.x.25:8983/solr/collection_4_shard3_replica_n8/ by increasing leader term => java.io.IOException: Request processing has stalled for 20017ms with 100 remaining elements in the queue. 2020-08-19 11:56:16.064 ERROR (updateExecutor-5-thread-8-processing-null) [ ] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=delete{_version_=-1675454745405292544,query=`{!cache=false}_expire_at_:[* TO 2020-08-19T11:55:47.604Z]`,commitWithin=-1}; node=ForwardNode: http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/ to http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/ => org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/: null On Tue, Aug 11, 2020 at 2:08 AM Anshuman Singh <singhanshuma...@gmail.com> wrote: > Just to give you an idea, this is how we are ingesting: > > {"id": 1, "field1": {"inc": 20}, "field2": {"inc": 30}, "field3": 40. > "field4": "some string"} > > We are using Solr-8.5.1. We have not configured any update processor. Hard > commit happens every minute or at 100k docs, soft commit happens every 10 > mins. > We have an external ZK setup with 5 nodes. > > Open files hard/soft limit is 65k and "max user processes" is unlimited. > > These are the different ERROR logs I found in the log files: > > ERROR (qtp1546693040-2637) [c:collection s:shard27 r:core_node109 > x:collection_shard27_replica_n106] o.a.s.s.HttpSolrCall > null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: > Async exception during distributed update: java.net.ConnectException: > Connection refused > > ERROR (qtp1546693040-1136) [c:collection s:shard101 r:core_node405 > x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall > null:java.io.IOException: java.lang.InterruptedException > > ERROR (qtp1546693040-2704) [c:collection s:shard101 r:core_node405 > x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall > null:org.eclipse.jetty.io.EofException: Reset cancel_stream_error > > ERROR (qtp1546693040-1344) [c:collection s:shard20 r:core_node79 > x:collection_shard20_replica_n76] o.a.s.h.RequestHandlerBase > org.apache.solr.common.SolrException: No registered leader was found after > waiting for 4000ms , collection: collection slice: shard48 saw > state=DocCollection(collection//collections/collection/state.json/96434)={ > > ERROR (qtp1546693040-2928) [c:collection s:shard80 r:core_node319 > x:collection_shard80_replica_n316] o.a.s.h.RequestHandlerBase > org.apache.solr.common.SolrException: Request says it is coming from > leader, but we are the leader > > ERROR (updateExecutor-5-thread-47-processing-n:192.100.20.19:8985_solr > x:collection_shard161_replica_n641 c:collection s:shard161 r:core_node646) > [c:collection s:shard161 r:core_node646 x:collection_shard161_replica_n641] > o.a.s.u.SolrCmdDistributor > org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: > Error from server at null: Expected mime type application/octet-stream but > got application/json > > ERROR (recoveryExecutor-7-thread-16-processing-n:192.100.20.33:8984_solr > x:collection_shard80_replica_n47 c:collection s:shard80 r:core_node48) > [c:collection s:shard80 r:core_node48 x:collection_shard80_replica_n47] > o.a.s.c.RecoveryStrategy Error while trying to recover. > core=collection_shard80_replica_n47:java.util.concurrent.ExecutionException: > org.apache.solr.client.solrj.SolrServerException: IOException occurred when > talking to server at: http://192.100.20.34:8984/solr > > ERROR (zkCallback-10-thread-22) [c:collection s:shard19 r:core_node322 > x:collection_shard19_replica_n321] o.a.s.c.ShardLeaderElectionContext There > was a problem trying to register as the > leader:org.apache.solr.common.AlreadyClosedException > > ERROR > (OverseerStateUpdate-176461820351853980-192.100.20.34:8985_solr-n_0000002357) > [ ] o.a.s.c.Overseer Overseer could not process the current clusterstate > state update message, skipping the message: { > > ERROR (main-EventThread) [ ] o.a.z.ClientCnxn Error while calling > watcher => java.lang.OutOfMemoryError: unable to create new native thread > > ERROR > (coreContainerWorkExecutor-2-thread-1-processing-n:192.100.20.34:8986_solr) > [ ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on > startup => org.apache.solr.cloud.ZkController$NotInClusterStateException: > coreNodeName core_node638 does not exist in shard shard105, ignore the > exception if the replica was deleted > > ERROR (qtp836220863-249) [c:collection s:shard162 r:core_node548 > x:collection_shard162_replica_n547] o.a.s.h.RequestHandlerBase > org.apache.solr.common.SolrException: No registered leader was found after > waiting for 4000ms , collection: collection slice: shard162 saw > state=DocCollection(collection//collections/collection/state.json/43121)={ > > Regards, > Anshuman > > On Mon, Aug 10, 2020 at 9:19 PM Jörn Franke <jornfra...@gmail.com> wrote: > >> How do you ingest it exactly with Atomtic updates ? Is there an update >> processor in-between? >> >> What are your settings for hard/soft commit? >> >> For the shared going to recovery - do you have a log entry or something ? >> >> What is the Solr version? >> >> How do you setup ZK? >> >> > Am 10.08.2020 um 16:24 schrieb Anshuman Singh < >> singhanshuma...@gmail.com>: >> > >> > Hi, >> > >> > We have a SolrCloud cluster with 10 nodes. We have 6B records ingested >> in >> > the Collection. Our use case requires atomic updates ("inc") on 5 >> fields. >> > Now almost 90% documents are atomic updates and as soon as we start our >> > ingestion pipelines, multiple shards start going into recovery, >> sometimes >> > all replicas of some shards go into down state. >> > The ingestion rate is also too slow with atomic updates, 4-5k per >> second. >> > We were able to ingest records without atomic updates at the rate of 50k >> > records per second without any issues. >> > >> > What I'm suspecting is, the fact that these "inc" atomic updates >> > require fetching of fields before indexing can cause slow rates but what >> > I'm not getting is, why are the replicas going into recovery? >> > >> > Regards, >> > Anshuman >> >