I've run into an issue with starting my solr cloud with many collections.
My setup is:
3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
server (256GB RAM).
5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
1 x Zookeeper 3.4.6
Java arg -Djute.maxbuffer=67108864 added to solr and ZK.

Then I stop all nodes, then start all nodes. All replicas are in the down
state, some have no leader. At times I have seen some (12 or so) leaders in
the active state. In the solr logs I see lots of:

org.apache.solr.cloud.ZkController; Still seeing conflicting information
about the leader of shard shard1 for collection DDDDDD-4351 after 30
seconds; our state says http://ftea1:8001/solr/DDDDDD-4351_shard1_replica1/,
but ZooKeeper says http://ftea1:8000/solr/DDDDDD-4351_shard1_replica2/

org.apache.solr.common.SolrException;
:org.apache.solr.common.SolrException: Error getting leader from zk for
shard shard1
        at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:910)
        at
org.apache.solr.cloud.ZkController.register(ZkController.java:822)
        at
org.apache.solr.cloud.ZkController.register(ZkController.java:770)
        at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:221)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: There is conflicting
information about the leader of shard: shard1 our state says:
http://ftea1:8001/solr/DDDDDD-1564_shard1_replica2/ but zookeeper says:
http://ftea1:8000/solr/DDDDDD-1564_shard1_replica1/
        at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:889)
        ... 6 more

I've tried staggering the starts (1min) but does not help.
I've reproduced with zero documents.
Restarts are OK up to around 3,000 cores.
Should this work?

Damien.

Reply via email to