Hi,

We have a 10 node (150G RAM, 1TB SAS HDD, 32 cores) Solr 8.5.1 cluster with
50 shards, rf 2 (NRT replicas), 7B docs, We have 5 Zk with 2 running on the
same nodes where Solr is running. Our use case requires continuous
ingestions (updates mostly). If we ingest at 40k records per sec, after
10-15mins some replicas go into recovery with the errors observed given in
the end. We also observed high CPU during these ingestions (60-70%) and
disks frequently reach 100% utilization.

We know our hardware is limited but this system will be used by only a few
users and search times taking a few minutes and slow ingestions are fine so
we are trying to run with these specifications for now but recovery is
becoming a bottleneck.

So to prevent recovery which I'm thinking could be due to high CPU/Disk
during ingestions, we reduced the data rate to 10k records per sec. Now CPU
usage is not high and recovery is not that frequent but it can happen in a
long run of 2-3 hrs. We further reduced the rate to 4k records per sec but
again it happened after 3-4 hrs. Logs were filled with the below error on
the instance on which recovery happened. Seems like reducing data rate is
not helping with recovery.

*2020-08-25 12:16:11.008 ERROR (qtp1546693040-235) [c:collection s:shard41
r:core_node565 x:collection_shard41_replica_n562] o.a.s.s.HttpSolrCall
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
timeout expired: 300000/300000 ms*

Solr thread dump showed commit threads taking upto 10-15 minutes. Currently
auto commit happens at 10M docs or 30seconds.

Can someone point me in the right direction? Also can we perform
core-binding for Solr processes?

*2020-08-24 12:32:55.835 WARN  (zkConnectionManagerCallback-11-thread-1) [
  ] o.a.s.c.c.ConnectionManager Watcher
org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
WatchedEvent state:Disconnected type:None path:null path: null type: None*














*2020-08-24 12:41:02.005 WARN  (main-SendThread(x.x.x.8:2181)) [   ]
o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
0x273f9a8fb229269 has expired2020-08-24 12:41:06.177 WARN
 (MetricsHistoryHandler-8-thread-1) [   ] o.a.s.h.a.MetricsHistoryHandler
Could not obtain overseer's address, skipping. =>
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/leader        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:134)org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/leader        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
~[?:?]        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[?:?]        at
org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
~[?:?]2020-08-24 12:41:13.365 WARN
 (zkConnectionManagerCallback-11-thread-1) [   ]
o.a.s.c.c.ConnectionManager Watcher
org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
WatchedEvent state:Expired type:None path:null path: null type:
None2020-08-24 12:41:13.366 WARN  (zkConnectionManagerCallback-11-thread-1)
[   ] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was
expired. Attempting to reconnect to recover relationship with
ZooKeeper...2020-08-24 12:41:16.705 ERROR (qtp1546693040-163255)
[c:collection s:shard31 r:core_node525 x:collection_shard31_replica_n522]
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Cannot
talk to ZooKeeper - Updates are disabled*

Reply via email to