Oh, and I was wondering if 'leaderVoteWait' might help in Solr4. On 27 February 2015 at 18:04, Damien Kamerman <dami...@gmail.com> wrote:
> This is going to push SolrCloud beyond its limits. Is this just an >> exercise to see how far you can push Solr, or are you looking at setting >> up a production install with several thousand collections? >> >> > I'm looking towards production. > > >> In Solr 4.x, the clusterstate is one giant JSON structure containing the >> state of the entire cloud. With 5000 collections, the entire thing >> would need to be downloaded and uploaded at least 5000 times during the >> course of a successful full system startup ... and I think with >> replicationFactor set to 2, that might actually be 10000 times. The >> best-case scenario is that it would take a VERY long time, the >> worst-case scenario is that concurrency problems would lead to a >> deadlock. A deadlock might be what is happening here. >> >> > Yes, clusterstate.json is 3.3M. At times on startup I think it does > deadlock; log shows after 1min: > org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes > published as DOWN in our cluster state. > > >> In Solr 5.x, the clusterstate is broken up so there's a separate state >> structure for each collection. This setup allows for faster and safer >> multi-threading and far less data transfer. Assuming I understand the >> implications correctly, there might not be any need to increase >> jute.maxbuffer with 5.x ... although I have to assume that I might be >> wrong about that. >> >> I would very much recommend that you set your scenario up from scratch >> in Solr 5.0.0, to see if the new clusterstate format can eliminate the >> problem you're seeing. If it doesn't, then we can pursue it as a likely >> bug in the 5.x branch and you can file an issue in Jira. >> >> > Thanks, will test in Solr 5.0.0. > -- Damien Kamerman