A couple of questions/suggestions
- This normally happens after leader election, when new leader gets
elected, it will force all the nodes to sync with itself.
Check logs to see when this happens, if leader was changed. If that is true
then you will have to investigate why leader change takes place.
I suspect leader goes into long enough GC pause that makes zookeeper leader
is no longer available and initiates leader election.

- What version of Solr you are using.  SOLR-8586
<https://issues.apache.org/jira/browse/SOLR-8586> introduced
IndexFingerprint check, unfortunately it was broken and hence replica would
always do full index replication. Issue is now fixed in SOLR-9310
<https://issues.apache.org/jira/browse/SOLR-9310>, this should help
replicas recover faster.

- You should also increase ulog log size (default threshold is 100 docs or
10 tlogs whichever is hit first). This will again help replicas recover
faster from tlogs (of course, there would be a threshold after which
recovering from tlog would in fact take longer than copying over all the
index files from leader)

On Thu, Oct 6, 2016 at 5:23 AM, Gerald Reinhart <gerald.reinh...@kelkoo.com>

> Hello everyone,
>     Our Solr Cloud  works very well for several months without any
> significant changes: the traffic to serve is stable, no major release
> deployed...
>     But randomly, the Solr Cloud leader puts all the replicas in recovery
> at the same time for no obvious reason.
>     Hence, we can not serve the queries any more and the leader is
> overloaded while replicating all the indexes on the replicas at the same
> time which eventually implies a downtime of approximately 30 minutes.
>     Is there a way to prevent it ? Ideally, a configuration saying a
> percentage of replicas to be put in recovery at the same time?
> Thanks,
> Gérald, Elodie and Ludovic
