> > This is going to push SolrCloud beyond its limits. Is this just an > exercise to see how far you can push Solr, or are you looking at setting > up a production install with several thousand collections? > > I'm looking towards production.
> In Solr 4.x, the clusterstate is one giant JSON structure containing the > state of the entire cloud. With 5000 collections, the entire thing > would need to be downloaded and uploaded at least 5000 times during the > course of a successful full system startup ... and I think with > replicationFactor set to 2, that might actually be 10000 times. The > best-case scenario is that it would take a VERY long time, the > worst-case scenario is that concurrency problems would lead to a > deadlock. A deadlock might be what is happening here. > > Yes, clusterstate.json is 3.3M. At times on startup I think it does deadlock; log shows after 1min: org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state. > In Solr 5.x, the clusterstate is broken up so there's a separate state > structure for each collection. This setup allows for faster and safer > multi-threading and far less data transfer. Assuming I understand the > implications correctly, there might not be any need to increase > jute.maxbuffer with 5.x ... although I have to assume that I might be > wrong about that. > > I would very much recommend that you set your scenario up from scratch > in Solr 5.0.0, to see if the new clusterstate format can eliminate the > problem you're seeing. If it doesn't, then we can pursue it as a likely > bug in the 5.x branch and you can file an issue in Jira. > > Thanks, will test in Solr 5.0.0.