On Sep 17, 2013, at 12:00 PM, Vladimir Veljkovic 
<vladimir.veljko...@boxalino.com> wrote:

> Hello there,
> 
> we have following setup:
> 
> SolrCloud 4.4.0 (3 nodes, physical machines)
> Zookeeper 3.4.5 (3 nodes, physical machines)
> 
> We have a number of rather small collections (~10K or ~100K of documents), 
> that we would like to load to all Solr instances (numShards=1, 
> replication_factor=3), and access them through local network interface, as 
> the load balancing is done in layers above.
> 
> We can live (and we actually do it in the test phase) with updating the 
> entire collections whenever we need it, switching collection aliases and 
> removing the old collections.
> 
> We stumbled across following problem: as soon as all three Solr nodes become 
> a leader to at least one collection, restarting any node makes it completely 
> unresponsive (timeout), both though admin interface and for replication. If 
> we restart all solr nodes the cluster end up in some kind of deadlock and 
> only remedy we found is Solr clean installation, removing ZooKeeper data and 
> re-posting collections.
> 
> Apparently, leader is waiting for replicas to come up and they try to 
> synchronize but timeout on http requests, so everything ends up in some kind 
> of dead lock, maybe related to:
> 
> https://issues.apache.org/jira/browse/SOLR-5240

Yup, that sounds exactly what you would expect with SOLR-5240. A fix for that 
is coming in 4.5, which is a probably a week or so away.

> 
> Eventually (after few minutes), leader takes over, mark collections "active" 
> but remains blocked on http interface, so other nodes can not synchronize.
> 
> In further tests, we loaded 4 collections with numShards=1 and 
> replication_factor=2. By chance, one node become the leader for all 4 
> collections. Restarting the node which was not the leader is done without the 
> problem, but when we restarted the leader it happened that:
> - leader shut down, other nodes became leaders of 2 collections each
> - leader starts up, 3 collections on it become "active", one collection 
> remains ”down” and node becomes unresponsive and timeouts on http requests.

Hard to say - I'll experiment with 4.5 and see if I can duplicate this.

- Mark

> 
> As this behavior is completely unexpected for one cluster solution, I wonder 
> if somebody else experienced same problems or we are doing something entirely 
> wrong.
> 
> Best regards
> 
> -- 
> 
> Vladimir Veljkovic
> Senior Java Entwickler
> 
> Boxalino AG
> 
> vladimir.veljko...@boxalino.com 
> www.boxalino.com 
> 
> 
> Tuning Kit for your Online Shop
> 
> Product Search - Recommendations - Landing Pages - Data intelligence - Mobile 
> Commerce 
> 
> 

Reply via email to