Hi, We have a 10 node (150G RAM, 1TB SAS HDD, 32 cores) Solr 8.5.1 cluster with 50 shards, rf 2 (NRT replicas), 7B docs, We have 5 Zk with 2 running on the same nodes where Solr is running. Our use case requires continuous ingestions (updates mostly). If we ingest at 40k records per sec, after 10-15mins some replicas go into recovery with the errors observed given in the end. We also observed high CPU during these ingestions (60-70%) and disks frequently reach 100% utilization.
We know our hardware is limited but this system will be used by only a few users and search times taking a few minutes and slow ingestions are fine so we are trying to run with these specifications for now but recovery is becoming a bottleneck. So to prevent recovery which I'm thinking could be due to high CPU/Disk during ingestions, we reduced the data rate to 10k records per sec. Now CPU usage is not high and recovery is not that frequent but it can happen in a long run of 2-3 hrs. We further reduced the rate to 4k records per sec but again it happened after 3-4 hrs. Logs were filled with the below error on the instance on which recovery happened. Seems like reducing data rate is not helping with recovery. *2020-08-25 12:16:11.008 ERROR (qtp1546693040-235) [c:collection s:shard41 r:core_node565 x:collection_shard41_replica_n562] o.a.s.s.HttpSolrCall null:java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms* Solr thread dump showed commit threads taking upto 10-15 minutes. Currently auto commit happens at 10M docs or 30seconds. Can someone point me in the right direction? Also can we perform core-binding for Solr processes? *2020-08-24 12:32:55.835 WARN (zkConnectionManagerCallback-11-thread-1) [ ] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.common.cloud.ConnectionManager@372ea2bc name: ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event WatchedEvent state:Disconnected type:None path:null path: null type: None* *2020-08-24 12:41:02.005 WARN (main-SendThread(x.x.x.8:2181)) [ ] o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session 0x273f9a8fb229269 has expired2020-08-24 12:41:06.177 WARN (MetricsHistoryHandler-8-thread-1) [ ] o.a.s.h.a.MetricsHistoryHandler Could not obtain overseer's address, skipping. => org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:134)org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:134) ~[?:?] at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[?:?] at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131) ~[?:?]2020-08-24 12:41:13.365 WARN (zkConnectionManagerCallback-11-thread-1) [ ] o.a.s.c.c.ConnectionManager Watcher org.apache.solr.common.cloud.ConnectionManager@372ea2bc name: ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event WatchedEvent state:Expired type:None path:null path: null type: None2020-08-24 12:41:13.366 WARN (zkConnectionManagerCallback-11-thread-1) [ ] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...2020-08-24 12:41:16.705 ERROR (qtp1546693040-163255) [c:collection s:shard31 r:core_node525 x:collection_shard31_replica_n522] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled*