[ https://issues.apache.org/jira/browse/SOLR-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
RAHAT BHALLA updated SOLR-10987: -------------------------------- Summary: Solr Cloud (5 nodes and 70 million documents) going down, when the overseer node becomes unreachable. Issue Started Recently (was: Solr Cloud (5 nodes and 70 million documents) going down, when the overseer node becomes unreachable. Started Recently) > Solr Cloud (5 nodes and 70 million documents) going down, when the overseer > node becomes unreachable. Issue Started Recently > ---------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-10987 > URL: https://issues.apache.org/jira/browse/SOLR-10987 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 6.1 > Environment: *The following is the usage on each of the Solr Nodes:* > Tasks: 254 total, 1 running, 252 sleeping, 0 stopped, 1 zombie > %Cpu(s): 0.4 us, 0.3 sy, 0.0 ni, 99.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 > st > KiB Mem : 20392276 total, 4169296 free, 2917012 used, 13305968 buff/cache > KiB Swap: 5111804 total, 5111636 free, 168 used. 16058184 avail Mem > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 21250 solr 20 0 23.599g 1.184g 228440 S 2.0 6.1 59:55.91 java > *Solr is running on 5 machines with similar configuration:* > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 1 > Core(s) per socket: 2 > Socket(s): 2 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 62 > Model name: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz > Stepping: 4 > CPU MHz: 2799.033 > BogoMIPS: 5600.00 > Hypervisor vendor: VMware > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 25600K > NUMA node0 CPU(s): 0-3 > Reporter: RAHAT BHALLA > Labels: assistance, critical, customer, impacting, issue, need, > production > > We host a Solr Cloud of 5 Nodes for Solr Instances and 3 Zookeeper nodes to > maintain the cloud. We have over 70 million docs spread across 13 collections > with 40K more documents being added every day almost near time within spans > of 5 to 6 minutes. > The System was working as expected and as required for th elast 7 months > until suddenly we saw the following exception and all of our instances went > offline. We restarted the instances and the cloud ran smoothly for three days > before it came crashing down again. > *Exception It gives before it goes down is as follows:* > 3542285 ERROR > (OverseerCollectionConfigSetProcessor-98221003671470081-prod-solr-node01:9080_solr-n_0000000106) > [ ] o.a.s.c.OverseerTaskProcessor > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /overseer_elect/leader > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:348) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) > at > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:345) > at > org.apache.solr.cloud.OverseerTaskProcessor.amILeader(OverseerTaskProcessor.java:384) > at > org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:191) > at java.lang.Thread.run(Unknown Source) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org