On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic <vladimir.veljko...@boxalino.com> wrote:
> Hello there, > > we have following setup: > > SolrCloud 4.4.0 (3 nodes, physical machines) > Zookeeper 3.4.5 (3 nodes, physical machines) > > We have a number of rather small collections (~10K or ~100K of documents), > that we would like to load to all Solr instances (numShards=1, > replication_factor=3), and access them through local network interface, as > the load balancing is done in layers above. > > We can live (and we actually do it in the test phase) with updating the > entire collections whenever we need it, switching collection aliases and > removing the old collections. > > We stumbled across following problem: as soon as all three Solr nodes become > a leader to at least one collection, restarting any node makes it completely > unresponsive (timeout), both though admin interface and for replication. If > we restart all solr nodes the cluster end up in some kind of deadlock and > only remedy we found is Solr clean installation, removing ZooKeeper data and > re-posting collections. > > Apparently, leader is waiting for replicas to come up and they try to > synchronize but timeout on http requests, so everything ends up in some kind > of dead lock, maybe related to: > > https://issues.apache.org/jira/browse/SOLR-5240 Yup, that sounds exactly what you would expect with SOLR-5240. A fix for that is coming in 4.5, which is a probably a week or so away. > > Eventually (after few minutes), leader takes over, mark collections "active" > but remains blocked on http interface, so other nodes can not synchronize. > > In further tests, we loaded 4 collections with numShards=1 and > replication_factor=2. By chance, one node become the leader for all 4 > collections. Restarting the node which was not the leader is done without the > problem, but when we restarted the leader it happened that: > - leader shut down, other nodes became leaders of 2 collections each > - leader starts up, 3 collections on it become "active", one collection > remains ”down” and node becomes unresponsive and timeouts on http requests. Hard to say - I'll experiment with 4.5 and see if I can duplicate this. - Mark > > As this behavior is completely unexpected for one cluster solution, I wonder > if somebody else experienced same problems or we are doing something entirely > wrong. > > Best regards > > -- > > Vladimir Veljkovic > Senior Java Entwickler > > Boxalino AG > > vladimir.veljko...@boxalino.com > www.boxalino.com > > > Tuning Kit for your Online Shop > > Product Search - Recommendations - Landing Pages - Data intelligence - Mobile > Commerce > >