RAHAT BHALLA created SOLR-10987: ----------------------------------- Summary: Solr Cloud (5 nodes and 70 million documents) going down, when the overseer node becomes unreachable. Started Recently Key: SOLR-10987 URL: https://issues.apache.org/jira/browse/SOLR-10987 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Affects Versions: 6.1 Environment: *The following is the usage on each of the Solr Nodes:*
Tasks: 254 total, 1 running, 252 sleeping, 0 stopped, 1 zombie %Cpu(s): 0.4 us, 0.3 sy, 0.0 ni, 99.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 20392276 total, 4169296 free, 2917012 used, 13305968 buff/cache KiB Swap: 5111804 total, 5111636 free, 168 used. 16058184 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21250 solr 20 0 23.599g 1.184g 228440 S 2.0 6.1 59:55.91 java *Solr is running on 5 machines with similar configuration:* Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz Stepping: 4 CPU MHz: 2799.033 BogoMIPS: 5600.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-3 Reporter: RAHAT BHALLA We host a Solr Cloud of 5 Nodes for Solr Instances and 3 Zookeeper nodes to maintain the cloud. We have over 70 million docs spread across 13 collections with 40K more documents being added every day almost near time within spans of 5 to 6 minutes. The System was working as expected and as required for th elast 7 months until suddenly we saw the following exception and all of our instances went offline. We restarted the instances and the cloud ran smoothly for three days before it came crashing down again. *Exception It gives before it goes down is as follows:* 3542285 ERROR (OverseerCollectionConfigSetProcessor-98221003671470081-prod-solr-node01:9080_solr-n_0000000106) [ ] o.a.s.c.OverseerTaskProcessor org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:348) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:345) at org.apache.solr.cloud.OverseerTaskProcessor.amILeader(OverseerTaskProcessor.java:384) at org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:191) at java.lang.Thread.run(Unknown Source) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org