Are you able to use TLOG replicas? That should reduce the time it takes to
recover significantly. It doesn't seem like you have a hard need for
near-real-time, since slow ingestions are fine.

- Houston

On Tue, Aug 25, 2020 at 12:03 PM Anshuman Singh <singhanshuma...@gmail.com>
wrote:

> Hi,
>
> We have a 10 node (150G RAM, 1TB SAS HDD, 32 cores) Solr 8.5.1 cluster with
> 50 shards, rf 2 (NRT replicas), 7B docs, We have 5 Zk with 2 running on the
> same nodes where Solr is running. Our use case requires continuous
> ingestions (updates mostly). If we ingest at 40k records per sec, after
> 10-15mins some replicas go into recovery with the errors observed given in
> the end. We also observed high CPU during these ingestions (60-70%) and
> disks frequently reach 100% utilization.
>
> We know our hardware is limited but this system will be used by only a few
> users and search times taking a few minutes and slow ingestions are fine so
> we are trying to run with these specifications for now but recovery is
> becoming a bottleneck.
>
> So to prevent recovery which I'm thinking could be due to high CPU/Disk
> during ingestions, we reduced the data rate to 10k records per sec. Now CPU
> usage is not high and recovery is not that frequent but it can happen in a
> long run of 2-3 hrs. We further reduced the rate to 4k records per sec but
> again it happened after 3-4 hrs. Logs were filled with the below error on
> the instance on which recovery happened. Seems like reducing data rate is
> not helping with recovery.
>
> *2020-08-25 12:16:11.008 ERROR (qtp1546693040-235) [c:collection s:shard41
> r:core_node565 x:collection_shard41_replica_n562] o.a.s.s.HttpSolrCall
> null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
> timeout expired: 300000/300000 ms*
>
> Solr thread dump showed commit threads taking upto 10-15 minutes. Currently
> auto commit happens at 10M docs or 30seconds.
>
> Can someone point me in the right direction? Also can we perform
> core-binding for Solr processes?
>
> *2020-08-24 12:32:55.835 WARN  (zkConnectionManagerCallback-11-thread-1) [
>   ] o.a.s.c.c.ConnectionManager Watcher
> org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
> ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
> WatchedEvent state:Disconnected type:None path:null path: null type: None*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *2020-08-24 12:41:02.005 WARN  (main-SendThread(x.x.x.8:2181)) [   ]
> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> 0x273f9a8fb229269 has expired2020-08-24 12:41:06.177 WARN
>  (MetricsHistoryHandler-8-thread-1) [   ] o.a.s.h.a.MetricsHistoryHandler
> Could not obtain overseer's address, skipping. =>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/leader        at
>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:134)org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/leader        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
> ~[?:?]        at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> ~[?:?]        at
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
> ~[?:?]2020-08-24 12:41:13.365 WARN
>  (zkConnectionManagerCallback-11-thread-1) [   ]
> o.a.s.c.c.ConnectionManager Watcher
> org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
> ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
> WatchedEvent state:Expired type:None path:null path: null type:
> None2020-08-24 12:41:13.366 WARN  (zkConnectionManagerCallback-11-thread-1)
> [   ] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was
> expired. Attempting to reconnect to recover relationship with
> ZooKeeper...2020-08-24 12:41:16.705 ERROR (qtp1546693040-163255)
> [c:collection s:shard31 r:core_node525 x:collection_shard31_replica_n522]
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Cannot
> talk to ZooKeeper - Updates are disabled*
>

Reply via email to