[jira] [Updated] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy

2014-05-14 Thread Raintung Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raintung Li updated SOLR-6056:
--

Description: 
Some errorsorg.apache.solr.common.SolrException: Error opening new searcher. 
exceeded limit of maxWarmingSearchers=2, try again later, that occur 
distributedupdateprocessor trig the core admin recover process.
That means every update request will send the core admin recover request.
(see the code DistributedUpdateProcessor.java doFinish())

The terrible thing is CoreAdminHandler will start a new thread to publish the 
recover status and start recovery. Threads increase very quickly, and stack OOM 
, Overseer can't handle a lot of status update , zookeeper node for  
/overseer/queue/qn-125553 increase more than 40 thousand in two minutes.

At the last zookeeper crash. 
The worse thing is queue has too much nodes in the zookeeper, the cluster can't 
publish the right status because only one overseer work, I have to start three 
threads to clear the queue nodes. The cluster doesn't work normal near 30 
minutes...



  was:
Some errorsorg.apache.solr.common.SolrException: Error opening new searcher. 
exceeded limit of maxWarmingSearchers=2, try again later, that occur 
distributedupdateprocessor trig the core admin recover process.
That means every update request will send the core admin recover request.
(see the code DistributedUpdateProcessor.java doFinish())

The terrible thing is CoreAdminHandler will start a new thread to publish the 
recover status and start recovery. Threads increase very quickly, and stack OOM 
, Overseer can't handle a lot of status update , zookeeper node for  
/overseer/queue/qn-125553 increase more than 40 thousand in two minutes.

At the last zookeeper crash. 
The worse thing is queue has to much nodes in the zookeeper, the cluster can't 
publish the right status because only one overseer work, I have to start three 
threads to clear the queue nodes. The cluster doesn't work normal near 30 
minutes...




 Zookeeper crash JVM stack OOM because of recover strategy 
 --

 Key: SOLR-6056
 URL: https://issues.apache.org/jira/browse/SOLR-6056
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.6
 Environment: Two linux servers, 65G memory, 16 core cpu
 20 collections, every collection has one shard two replica 
 one zookeeper
Reporter: Raintung Li
Priority: Critical
  Labels: cluster, crash, recover

 Some errorsorg.apache.solr.common.SolrException: Error opening new searcher. 
 exceeded limit of maxWarmingSearchers=2, try again later, that occur 
 distributedupdateprocessor trig the core admin recover process.
 That means every update request will send the core admin recover request.
 (see the code DistributedUpdateProcessor.java doFinish())
 The terrible thing is CoreAdminHandler will start a new thread to publish the 
 recover status and start recovery. Threads increase very quickly, and stack 
 OOM , Overseer can't handle a lot of status update , zookeeper node for  
 /overseer/queue/qn-125553 increase more than 40 thousand in two minutes.
 At the last zookeeper crash. 
 The worse thing is queue has too much nodes in the zookeeper, the cluster 
 can't publish the right status because only one overseer work, I have to 
 start three threads to clear the queue nodes. The cluster doesn't work normal 
 near 30 minutes...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy

2014-05-13 Thread Raintung Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raintung Li updated SOLR-6056:
--

Priority: Critical  (was: Major)

 Zookeeper crash JVM stack OOM because of recover strategy 
 --

 Key: SOLR-6056
 URL: https://issues.apache.org/jira/browse/SOLR-6056
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.6
 Environment: Two linux server, 65G, 16 core cup
 20 collections, every collection has one shard two replica 
 one zookeeper
Reporter: Raintung Li
Priority: Critical
  Labels: cluster, crash, recover

 Some errorsorg.apache.solr.common.SolrException: Error opening new searcher. 
 exceeded limit of maxWarmingSearchers=2, try again later, that occur 
 distributedupdateprocessor trig the core admin recover process.
 That means every update request will send the core admin recover request.
 (see the code DistributedUpdateProcessor.java doFinish())
 The terrible thing is CoreAdminHandler will start a new thread to publish the 
 recover status and start recovery. Threads increase very quickly, and stack 
 OOM , Overseer can't handle a lot of status update , zookeeper node for  
 /overseer/queue/qn-125553 increase more than 40 thousand in two minutes.
 At the last zookeeper crash. 
 The worse thing is queue has to much nodes in the zookeeper, the cluster 
 can't publish the right status because only one overseer work, I have to 
 start three threads to clear the queue nodes. The cluster doesn't work normal 
 near 30 minutes...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy

2014-05-11 Thread Raintung Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raintung Li updated SOLR-6056:
--

Environment: 
Two linux servers, 65G memory, 16 core cpu
20 collections, every collection has one shard two replica 
one zookeeper

  was:
Two linux server, 65G, 16 core cup
20 collections, every collection has one shard two replica 
one zookeeper


 Zookeeper crash JVM stack OOM because of recover strategy 
 --

 Key: SOLR-6056
 URL: https://issues.apache.org/jira/browse/SOLR-6056
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.6
 Environment: Two linux servers, 65G memory, 16 core cpu
 20 collections, every collection has one shard two replica 
 one zookeeper
Reporter: Raintung Li
Priority: Critical
  Labels: cluster, crash, recover

 Some errorsorg.apache.solr.common.SolrException: Error opening new searcher. 
 exceeded limit of maxWarmingSearchers=2, try again later, that occur 
 distributedupdateprocessor trig the core admin recover process.
 That means every update request will send the core admin recover request.
 (see the code DistributedUpdateProcessor.java doFinish())
 The terrible thing is CoreAdminHandler will start a new thread to publish the 
 recover status and start recovery. Threads increase very quickly, and stack 
 OOM , Overseer can't handle a lot of status update , zookeeper node for  
 /overseer/queue/qn-125553 increase more than 40 thousand in two minutes.
 At the last zookeeper crash. 
 The worse thing is queue has to much nodes in the zookeeper, the cluster 
 can't publish the right status because only one overseer work, I have to 
 start three threads to clear the queue nodes. The cluster doesn't work normal 
 near 30 minutes...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6056) Zookeeper crash JVM stack OOM because of recover strategy

2014-05-11 Thread Raintung Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raintung Li updated SOLR-6056:
--

Attachment: patch-6056.txt

 Zookeeper crash JVM stack OOM because of recover strategy 
 --

 Key: SOLR-6056
 URL: https://issues.apache.org/jira/browse/SOLR-6056
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.6
 Environment: Two linux servers, 65G memory, 16 core cpu
 20 collections, every collection has one shard two replica 
 one zookeeper
Reporter: Raintung Li
Priority: Critical
  Labels: cluster, crash, recover
 Attachments: patch-6056.txt


 Some errorsorg.apache.solr.common.SolrException: Error opening new searcher. 
 exceeded limit of maxWarmingSearchers=2, try again later, that occur 
 distributedupdateprocessor trig the core admin recover process.
 That means every update request will send the core admin recover request.
 (see the code DistributedUpdateProcessor.java doFinish())
 The terrible thing is CoreAdminHandler will start a new thread to publish the 
 recover status and start recovery. Threads increase very quickly, and stack 
 OOM , Overseer can't handle a lot of status update , zookeeper node for  
 /overseer/queue/qn-125553 increase more than 40 thousand in two minutes.
 At the last zookeeper crash. 
 The worse thing is queue has too much nodes in the zookeeper, the cluster 
 can't publish the right status because only one overseer work, I have to 
 start three threads to clear the queue nodes. The cluster doesn't work normal 
 near 30 minutes...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org