[jira] [Created] (SOLR-10987) Solr Cloud (5 nodes and 70 million documents) going down, when the overseer node becomes unreachable. Started Recently

RAHAT BHALLA (JIRA) Fri, 30 Jun 2017 08:50:45 -0700

RAHAT BHALLA created SOLR-10987:
-----------------------------------

             Summary: Solr Cloud (5 nodes and 70 million documents) going down, 
when the overseer node becomes unreachable. Started Recently
                 Key: SOLR-10987
                 URL: https://issues.apache.org/jira/browse/SOLR-10987
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
    Affects Versions: 6.1
         Environment: *The following is the usage on each of the Solr Nodes:*


Tasks: 254 total,   1 running, 252 sleeping,   0 stopped,   1 zombie
%Cpu(s):  0.4 us,  0.3 sy,  0.0 ni, 99.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 20392276 total,  4169296 free,  2917012 used, 13305968 buff/cache
KiB Swap:  5111804 total,  5111636 free,      168 used. 16058184 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21250 solr      20   0 23.599g 1.184g 228440 S   2.0  6.1  59:55.91 java



*Solr is running on 5 machines with similar configuration:*

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Stepping:              4
CPU MHz:               2799.033
BogoMIPS:              5600.00
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-3



            Reporter: RAHAT BHALLA


We host a Solr Cloud of 5 Nodes for Solr Instances and 3 Zookeeper nodes to 
maintain the cloud. We have over 70 million docs spread across 13 collections 
with 40K more documents being added every day almost near time within spans of 
5 to 6 minutes.

The System was working as expected and as required for th elast 7 months until 
suddenly we saw the following exception and all of our instances went offline. 
We restarted the instances and the cloud ran smoothly for three days before it 
came crashing down again.

*Exception It gives before it goes down is as follows:*

3542285 ERROR 
(OverseerCollectionConfigSetProcessor-98221003671470081-prod-solr-node01:9080_solr-n_0000000106)
 [   ] o.a.s.c.OverseerTaskProcessor
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /overseer_elect/leader
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:348)
        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
        at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:345)
        at 
org.apache.solr.cloud.OverseerTaskProcessor.amILeader(OverseerTaskProcessor.java:384)
        at 
org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:191)
        at java.lang.Thread.run(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-10987) Solr Cloud (5 nodes and 70 million documents) going down, when the overseer node becomes unreachable. Started Recently

Reply via email to