Yes, I have the same problem.

2012/10/5 Kyryl Bilokurov <kyryl.biloku...@gmail.com>

> Hi,
>
> I have a functional/performance test SolrCloud cluster (using Solr
> 4.0-BETA) with the following setup: 4 servers, each server hosts 1/4th of
> the collection (no replicas, so there are only leaders for each shard).
> Current ZK client timeout is set to 15 seconds. From time to time I see
> that Solr's ZK client connection gets timed out:
>
> ======
> INFO: Client session timed out, have not heard from server in 19105ms for
> sessionid 0x3388fcec9490677, closing socket connection and attempting
> reconnect
> ======
>
> The reconnect is triggered, but after the reconnect, shard enters into the
> bad state, as it cannot get the leader props for the extended period of
> time:
>
> ======
> INFO: Updating cluster state from ZooKeeper...
> Oct 3, 2012 4:07:20 AM org.apache.solr.common.cloud.ZkStateReader$2 process
> INFO: A cluster state change has occurred - updating...
> Oct 3, 2012 4:07:50 AM org.apache.solr.common.SolrException log
> SEVERE: There was a problem finding the leader in
> zk:java.lang.RuntimeException: Could not get leader props
>         at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:640)
>         at
>
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1031)
>         at
>
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:233)
>         at
> org.apache.solr.cloud.ZkController.access$300(ZkController.java:77)
>         at
> org.apache.solr.cloud.ZkController$1.command(ZkController.java:180)
>         at
>
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:101)
>         at
>
> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:47)
>         at
>
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:85)
>         at
>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
> ....
> ...the same message&stacktrace repeats every ~30 seconds, until it changes
> to
> ...
> Oct 3, 2012 4:20:09 AM org.apache.solr.common.SolrException log
> SEVERE: :org.apache.solr.common.SolrException: There was a problem finding
> the leader in zk
>         at
>
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1041)
>         at
>
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:233)
>         at
> org.apache.solr.cloud.ZkController.access$300(ZkController.java:77)
>         at
> org.apache.solr.cloud.ZkController$1.command(ZkController.java:180)
>         at
>
> org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:101)
>         at
>
> org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:47)
>         at
>
> org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:85)
>         at
>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
>
> Oct 3, 2012 4:20:09 AM org.apache.solr.cloud.ZkController
> createEphemeralLiveNode
> INFO: Register node as live in ZooKeeper:/live_nodes/host.domain:18100_solr
> Oct 3, 2012 4:20:09 AM org.apache.solr.common.cloud.SolrZkClient makePath
> INFO: makePath: /live_nodes/host.domain:18100_solr
> ...
> ... at this point cluster seems to be OK for some time.
> ======
>
> This looks a bit similar to the SOLR-3274, as it is also triggered by the
> expired ZK connection, and results in "No servers hosting shard" search
> errors.
>
> For now, I have increased the timeout to the 30secs, similar to suggested
> in SOLR-3274 to lower down the probability of ZK timeouts, but shouldn't
> cluster heal faster than in 15 mins? As there is only one server hosting
> each shard, it could become a leader instantly.
>
> Thanks,
> Kyryl
>

Reply via email to