Are you able to use TLOG replicas? That should reduce the time it takes to recover significantly. It doesn't seem like you have a hard need for near-real-time, since slow ingestions are fine.
- Houston On Tue, Aug 25, 2020 at 12:03 PM Anshuman Singh <singhanshuma...@gmail.com> wrote: > Hi, > > We have a 10 node (150G RAM, 1TB SAS HDD, 32 cores) Solr 8.5.1 cluster with > 50 shards, rf 2 (NRT replicas), 7B docs, We have 5 Zk with 2 running on the > same nodes where Solr is running. Our use case requires continuous > ingestions (updates mostly). If we ingest at 40k records per sec, after > 10-15mins some replicas go into recovery with the errors observed given in > the end. We also observed high CPU during these ingestions (60-70%) and > disks frequently reach 100% utilization. > > We know our hardware is limited but this system will be used by only a few > users and search times taking a few minutes and slow ingestions are fine so > we are trying to run with these specifications for now but recovery is > becoming a bottleneck. > > So to prevent recovery which I'm thinking could be due to high CPU/Disk > during ingestions, we reduced the data rate to 10k records per sec. Now CPU > usage is not high and recovery is not that frequent but it can happen in a > long run of 2-3 hrs. We further reduced the rate to 4k records per sec but > again it happened after 3-4 hrs. Logs were filled with the below error on > the instance on which recovery happened. Seems like reducing data rate is > not helping with recovery. > > *2020-08-25 12:16:11.008 ERROR (qtp1546693040-235) [c:collection s:shard41 > r:core_node565 x:collection_shard41_replica_n562] o.a.s.s.HttpSolrCall > null:java.io.IOException: java.util.concurrent.TimeoutException: Idle > timeout expired: 300000/300000 ms* > > Solr thread dump showed commit threads taking upto 10-15 minutes. Currently > auto commit happens at 10M docs or 30seconds. > > Can someone point me in the right direction? Also can we perform > core-binding for Solr processes? > > *2020-08-24 12:32:55.835 WARN (zkConnectionManagerCallback-11-thread-1) [ > ] o.a.s.c.c.ConnectionManager Watcher > org.apache.solr.common.cloud.ConnectionManager@372ea2bc name: > ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event > WatchedEvent state:Disconnected type:None path:null path: null type: None* > > > > > > > > > > > > > > > *2020-08-24 12:41:02.005 WARN (main-SendThread(x.x.x.8:2181)) [ ] > o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session > 0x273f9a8fb229269 has expired2020-08-24 12:41:06.177 WARN > (MetricsHistoryHandler-8-thread-1) [ ] o.a.s.h.a.MetricsHistoryHandler > Could not obtain overseer's address, skipping. => > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /overseer_elect/leader at > > org.apache.zookeeper.KeeperException.create(KeeperException.java:134)org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /overseer_elect/leader at > org.apache.zookeeper.KeeperException.create(KeeperException.java:134) > ~[?:?] at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > ~[?:?] at > org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131) > ~[?:?]2020-08-24 12:41:13.365 WARN > (zkConnectionManagerCallback-11-thread-1) [ ] > o.a.s.c.c.ConnectionManager Watcher > org.apache.solr.common.cloud.ConnectionManager@372ea2bc name: > ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event > WatchedEvent state:Expired type:None path:null path: null type: > None2020-08-24 12:41:13.366 WARN (zkConnectionManagerCallback-11-thread-1) > [ ] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was > expired. Attempting to reconnect to recover relationship with > ZooKeeper...2020-08-24 12:41:16.705 ERROR (qtp1546693040-163255) > [c:collection s:shard31 r:core_node525 x:collection_shard31_replica_n522] > o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Cannot > talk to ZooKeeper - Updates are disabled* >