[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

Steve Rowe (JIRA) Wed, 12 Apr 2017 07:40:22 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Rowe updated SOLR-10420:
------------------------------
    Attachment: OverseerTest.80.stdout

I ran all Solr tests with the patch on master, and one test failed: 

{noformat}
   [junit4]   2> 264992 ERROR (OverseerExitThread) [    ] o.a.s.c.Overseer 
could not read the data
   [junit4]   2> org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /overseer_elect/leader
   [junit4]   2>        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
   [junit4]   2>        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
   [junit4]   2>        at 
org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:356)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:353)
   [junit4]   2>        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
   [junit4]   2>        at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:353)
   [junit4]   2>        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:290)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=OverseerTest 
-Dtests.method=testExternalClusterStateChangeBehavior 
-Dtests.seed=2110CE0AEF674CFA -Dtests.slow=true -Dtests.locale=es-GT 
-Dtests.timezone=Asia/Kolkata -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 5.46s J12 | 
OverseerTest.testExternalClusterStateChangeBehavior <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: Illegal state, was: 
down expected:active clusterState:live 
nodes:[]collections:{c1=DocCollection(c1//clusterstate.json/2)={
   [junit4]    >   "shards":{"shard1":{
   [junit4]    >       "parent":null,
   [junit4]    >       "range":null,
   [junit4]    >       "state":"active",
   [junit4]    >       "replicas":{"core_node1":{
   [junit4]    >           "base_url":"http://127.0.0.1/solr";,
   [junit4]    >           "node_name":"node1",
   [junit4]    >           "core":"core1",
   [junit4]    >           "roles":"",
   [junit4]    >           "state":"down"}}}},
   [junit4]    >   "router":{"name":"implicit"}}, test=LazyCollectionRef(test)}
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([2110CE0AEF674CFA:490ECDE60DF716B4]:0)
   [junit4]    >        at 
org.apache.solr.cloud.AbstractDistribZkTestBase.verifyReplicaStatus(AbstractDistribZkTestBase.java:273)
   [junit4]    >        at 
org.apache.solr.cloud.OverseerTest.testExternalClusterStateChangeBehavior(OverseerTest.java:1259)
{noformat}

I ran the repro line a couple of times and it didn't reproduce.  I then beasted 
100 iterations of the test suite using Miller's beasting script, and it failed 
once.  I'm attaching the test log from the failure.

Looking at emailed Jenkins reports of 
{{testExternalClusterStateChangeBehavior()}} failing, I see that it was failing 
almost daily until the day SOLR-9191 was committed (June 9, 2016), and then 
zero failures since, so this failure seems suspicious to me, since this issue 
is related to SOLR-9191.

I beasted 200 iterations of OverseerTest without the patch, and got zero 
failures.

> Solr 6.x leaking one SolrZkClient instance per second
> -----------------------------------------------------
>
>                 Key: SOLR-10420
>                 URL: https://issues.apache.org/jira/browse/SOLR-10420
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 6.5, 6.4.2
>            Reporter: Markus Jelsma
>             Fix For: master (7.0), branch_6x
>
>         Attachments: OverseerTest.80.stdout, SOLR-10420.patch
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

Reply via email to